Skip to content

Melusine Detectors

The MelusineDetector component aims at standardizing how detection is performed in a MelusinePipeline.

Tip

Project running over several years (such as email automation) may accumulate technical debt over time. Standardizing code practices can limit the technical debt and ease the onboarding of new developers.

The MelusineDetector class splits detection into three steps:

  • pre_detect: Select/combine the inputs needed for detection. For example, select the text parts tagged as BODY and combine them with the text of the email header.
  • detect: Use regular expressions, ML models or heuristics to run detection on the input text.
  • post_detect: Run detection rules such as thresholding or combine results from multiple models.

The method transform is defined by the BaseClass MelusineDetector and will call the pre_detect/detect/post_detect methods in turn (Template pattern).

# Instantiate Detector
detector = MyDetector()

# Run pre_detect, detect and post_detect on input data
data_with_detection = detector.transform(data)

Here is the full code of a MelusineDetector to detect emails related to viruses. The next sections break down the different parts of the code.

class MyCustomDetector(BaseMelusineDetector):
    @property
    def transform_methods(self) -> List[Callable]:
        return [self.prepare, self.run]

    def prepare(self, row, debug_mode=False):
        return row

    def run(self, row, debug_mode=False):
        row[self.output_columns[0]] = "12345"
        return row

Here, the detector is run on a simple Pandas Dataframe:

df = pd.DataFrame(
    [
        {"input_col": "test1"},
        {"input_col": "test2"},
    ]
)

detector = MyCustomDetector(input_columns=["input_col"], output_columns=["output_col"], name="custom")

df = detector.transform(df)

The output is a Dataframe with a new virus_result column.

body header virus_result
0 This is a dangerous virus test True
1 test test False
2 test viruses are dangerous True
3 corona virus is annoying test False

Tip

Columns that are not declared in the output_columns are dropped automatically.

Detector init

In the init method, you should call the superclass init and provide:

  • A name for the detector
  • Inputs columns
  • Output columns

Tip

If the init method of the super class is enough (parameters name, input_columns and output_columns) you may skip the init method entirely when defining your MelusineDetector.

Detector pre_detect

The pre_detect method simply combines the header text and the body text (separated by a line break).

def pre_detect(self, df, debug_mode=False):
    # Assemble the text columns into a single column
    df[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + "\n" + df[self.body_column]

    return df

Detector detect

The detect applies two regexes on the selected text: - A positive regex to catch mentions to viruses. - A negative regex to avoid false positive detections.

def detect(self, df, debug_mode=False):
    text_column = df[self.TMP_DETECTION_INPUT_COLUMN]
    positive_regex = r"(virus)"
    negative_regex = r"(corona[ _]virus)"

    # Pandas str.extract method on columns
    df[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)
    df[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)

    return df

Detector post_detect

The post_detect combines the regex detection result to determine the final result.

def post_detect(self, df, debug_mode=False):
    # Boolean operation on pandas column
    df[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] & ~df[self.TMP_NEGATIVE_REGEX_MATCH]
    return df

Are MelusineDetectors mandatory for melusine?

No.

You can use any scikit-learn compatible component in your MelusinePipeline. However, we recommend using the MelusineDetector (and MelusineTransformer) classes to benefit from:

  • Code standardization
  • Input columns validation
  • Dataframe backend variabilization. Today native Python dictionaries and pandas backend are supported but more backends may be added (e.g. polars)
  • Debug mode
  • Multiprocessing

Check-out the next tutorial to discover advanced features of the MelusineDetector class.