Melusine Detectors
The MelusineDetector
component aims at standardizing how detection is performed in a MelusinePipeline
.
Tip
Project running over several years (such as email automation) may accumulate technical debt over time. Standardizing code practices can limit the technical debt and ease the onboarding of new developers.
The MelusineDetector
class splits detection into three steps:
pre_detect
: Select/combine the inputs needed for detection. For example, select the text parts tagged asBODY
and combine them with the text of the email header.detect
: Use regular expressions, ML models or heuristics to run detection on the input text.post_detect
: Run detection rules such as thresholding or combine results from multiple models.
The method transform
is defined by the BaseClass MelusineDetector
and will call the pre_detect
/detect
/post_detect
methods in turn (Template pattern).
# Instantiate Detector
detector = MyDetector()
# Run pre_detect, detect and post_detect on input data
data_with_detection = detector.transform(data)
Here is the full code of a MelusineDetector to detect emails related to viruses. The next sections break down the different parts of the code.
class MyCustomDetector(BaseMelusineDetector):
@property
def transform_methods(self) -> List[Callable]:
return [self.prepare, self.run]
def prepare(self, row, debug_mode=False):
return row
def run(self, row, debug_mode=False):
row[self.output_columns[0]] = "12345"
return row
Here, the detector is run on a simple Pandas Dataframe:
df = pd.DataFrame(
[
{"input_col": "test1"},
{"input_col": "test2"},
]
)
detector = MyCustomDetector(input_columns=["input_col"], output_columns=["output_col"], name="custom")
df = detector.transform(df)
The output is a Dataframe with a new virus_result
column.
body | header | virus_result | |
---|---|---|---|
0 | This is a dangerous virus | test | True |
1 | test | test | False |
2 | test | viruses are dangerous | True |
3 | corona virus is annoying | test | False |
Tip
Columns that are not declared in the output_columns
are dropped automatically.
Detector init
In the init method, you should call the superclass init and provide:
- A name for the detector
- Inputs columns
- Output columns
Tip
If the init method of the super class is enough (parameters name
, input_columns
and output_columns
)
you may skip the init method entirely when defining your MelusineDetector
.
Detector pre_detect
The pre_detect
method simply combines the header text and the body text (separated by a line break).
def pre_detect(self, df, debug_mode=False):
# Assemble the text columns into a single column
df[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + "\n" + df[self.body_column]
return df
Detector detect
The detect
applies two regexes on the selected text:
- A positive regex to catch mentions to viruses.
- A negative regex to avoid false positive detections.
def detect(self, df, debug_mode=False):
text_column = df[self.TMP_DETECTION_INPUT_COLUMN]
positive_regex = r"(virus)"
negative_regex = r"(corona[ _]virus)"
# Pandas str.extract method on columns
df[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)
df[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)
return df
Detector post_detect
The post_detect
combines the regex detection result to determine the final result.
def post_detect(self, df, debug_mode=False):
# Boolean operation on pandas column
df[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] & ~df[self.TMP_NEGATIVE_REGEX_MATCH]
return df
Are MelusineDetectors
mandatory for melusine?
No.
You can use any scikit-learn
compatible component in your MelusinePipeline
.
However, we recommend using the MelusineDetector
(and MelusineTransformer
) classes to benefit from:
- Code standardization
- Input columns validation
- Dataframe backend variabilization. Today native Python dictionaries and
pandas
backend are supported but more backends may be added (e.g.polars
) - Debug mode
- Multiprocessing
Check-out the next tutorial to discover advanced features of the MelusineDetector
class.