Melusine Detectors
The MelusineDetector component aims at standardizing how detection is performed in a MelusinePipeline.
Tip
Project running over several years (such as email automation) may accumulate technical debt over time. Standardizing code practices can limit the technical debt and ease the onboarding of new developers.
The MelusineDetector class splits detection into three steps:
pre_detect: Select/combine the inputs needed for detection. For example, select the text parts tagged asBODYand combine them with the text of the email header.detect: Use regular expressions, ML models or heuristics to run detection on the input text.post_detect: Run detection rules such as thresholding or combine results from multiple models.
The method transform is defined by the BaseClass MelusineDetector and will call the pre_detect/detect/post_detect methods in turn (Template pattern).
# Instantiate Detector
detector = MyDetector()
# Run pre_detect, detect and post_detect on input data
data_with_detection = detector.transform(data)
Here is the full code of a MelusineDetector to detect emails related to viruses. The next sections break down the different parts of the code.
class MyCustomDetector(BaseMelusineDetector):
@property
def transform_methods(self) -> List[Callable]:
return [self.prepare, self.run]
def prepare(self, row, debug_mode=False):
return row
def run(self, row, debug_mode=False):
row[self.output_columns[0]] = "12345"
return row
Here, the detector is run on a simple Pandas Dataframe:
df = pd.DataFrame(
[
{"input_col": "test1"},
{"input_col": "test2"},
]
)
detector = MyCustomDetector(input_columns=["input_col"], output_columns=["output_col"], name="custom")
df = detector.transform(df)
The output is a Dataframe with a new virus_result column.
| body | header | virus_result | |
|---|---|---|---|
| 0 | This is a dangerous virus | test | True |
| 1 | test | test | False |
| 2 | test | viruses are dangerous | True |
| 3 | corona virus is annoying | test | False |
Tip
Columns that are not declared in the output_columns are dropped automatically.
Detector init
In the init method, you should call the superclass init and provide:
- A name for the detector
- Inputs columns
- Output columns
Tip
If the init method of the super class is enough (parameters name, input_columns and output_columns)
you may skip the init method entirely when defining your MelusineDetector.
Detector pre_detect
The pre_detect method simply combines the header text and the body text (separated by a line break).
def pre_detect(self, df, debug_mode=False):
# Assemble the text columns into a single column
df[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + "\n" + df[self.body_column]
return df
Detector detect
The detect applies two regexes on the selected text:
- A positive regex to catch mentions to viruses.
- A negative regex to avoid false positive detections.
def detect(self, df, debug_mode=False):
text_column = df[self.TMP_DETECTION_INPUT_COLUMN]
positive_regex = r"(virus)"
negative_regex = r"(corona[ _]virus)"
# Pandas str.extract method on columns
df[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)
df[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)
return df
Detector post_detect
The post_detect combines the regex detection result to determine the final result.
def post_detect(self, df, debug_mode=False):
# Boolean operation on pandas column
df[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] & ~df[self.TMP_NEGATIVE_REGEX_MATCH]
return df
Are MelusineDetectors mandatory for melusine?
No.
You can use any scikit-learn compatible component in your MelusinePipeline.
However, we recommend using the MelusineDetector (and MelusineTransformer) classes to benefit from:
- Code standardization
- Input columns validation
- Dataframe backend variabilization. Today native Python dictionaries and
pandasbackend are supported but more backends may be added (e.g.polars) - Debug mode
- Multiprocessing
Check-out the next tutorial to discover advanced features of the MelusineDetector class.