Advanced Melusine Detectors

This tutorial presents the advanced features of the MelusineDetector class:

Debug mode
Row wise methods vs DataFrame wise methods
Custom transform methods

Debug mode

MelusineDetector are designed to be easily debugged. For that purpose, the pre-detect/detect/post-detect methods all have a debug_mode argument. The debug mode is activated by setting the debug attribute of a dataframe to True.

import pandas as pd
df = pd.DataFrame({"bla": [1, 2, 3]})
df.debug = True

Warning

Debug mode activation is backend dependent. With a DictBackend, tou should use my_dict["debug"] = True

When debug mode is activated, a column named DETECTOR_NAME_debug containing an empty dictionary is automatically created. Populating this debug dict with debug info is then left to the user's responsibility.

Example of a detector with debug data:

class MyVirusDetector(MelusineDetector):
    OUTPUT_RESULT_COLUMN = "virus_result"
    TMP_DETECTION_INPUT_COLUMN = "detection_input"
    TMP_POSITIVE_REGEX_MATCH = "positive_regex_match"
    TMP_NEGATIVE_REGEX_MATCH = "negative_regex_match"

    def __init__(self, body_column: str, header_column: str):
        self.body_column = body_column
        self.header_column = header_column

        super().__init__(
            input_columns=[self.body_column, self.header_column],
            output_columns=[self.OUTPUT_RESULT_COLUMN],
            name="virus",
        )

    def pre_detect(self, row, debug_mode=False):
        effective_text = row[self.header_column] + "\n" + row[self.body_column]
        row[self.TMP_DETECTION_INPUT_COLUMN] = effective_text

        if debug_mode:
            row[self.debug_dict_col] = {"detection_input": row[self.TMP_DETECTION_INPUT_COLUMN]}

        return row

    def detect(self, row, debug_mode=False):
        text = row[self.TMP_DETECTION_INPUT_COLUMN]
        positive_regex = r"virus"
        negative_regex = r"corona[ _]virus"

        positive_match = re.search(positive_regex, text)
        negative_match = re.search(negative_regex, text)

        row[self.TMP_POSITIVE_REGEX_MATCH] = bool(positive_match)
        row[self.TMP_NEGATIVE_REGEX_MATCH] = bool(negative_match)

        if debug_mode:
            positive_match_text = (
                positive_match.string[positive_match.start() : positive_match.end()] if positive_match else None
            )
            negative_match_text = (
                positive_match.string[negative_match.start() : negative_match.end()] if negative_match else None
            )
            row[self.debug_dict_col].update(
                {
                    "positive_match_data": {"result": bool(positive_match), "match_text": positive_match_text},
                    "negative_match_data": {"result": bool(negative_match), "match_text": negative_match_text},
                }
            )

        return row

    def post_detect(self, row, debug_mode=False):
        if row[self.TMP_POSITIVE_REGEX_MATCH] and not row[self.TMP_NEGATIVE_REGEX_MATCH]:
            row[self.OUTPUT_RESULT_COLUMN] = True
        else:
            row[self.OUTPUT_RESULT_COLUMN] = False

        return row

In the end, an extra column is created containing debug data:

	virus_result	debug_virus
0	True	{'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}}
1	False	{'detection_input': '...', 'positive_match_data': {'result': False, 'match_text': None}, 'negative_match_data': {'result': False, 'match_text': None}}
2	True	{'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}}
3	False	{'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}}

Row Methods VS Dataframe Methods

There are two ways to use the pre-detect/detect/post-detect methods:

Row wise: The method works on a single row of a DataFrame. In that case, a map-like method is used to apply it on an entire dataframe (typically pandas.DataFrame.apply is used with the PandasBackend).
Dataframe wise: The method works directly on the entire DataFrame.

Tip

Using row wise methods make your code backend independent. You may switch from a PandasBackend to a DictBackend at any time. The PandasBackend also supports multiprocessing for row wise methods.

To use row wise methods, you just need to name the first parameter of "row". Otherwise, dataframe wise transformations are used.

Example of a Detector with dataframe wise method (works with a PandasBackend only).

class MyVirusDetector(MelusineDetector):
    """
    Detect if the text expresses dissatisfaction.
    """

    # Dataframe column names
    OUTPUT_RESULT_COLUMN = "virus_result"
    TMP_DETECTION_INPUT_COLUMN = "detection_input"
    TMP_POSITIVE_REGEX_MATCH = "positive_regex_match"
    TMP_NEGATIVE_REGEX_MATCH = "negative_regex_match"

    def __init__(self, body_column: str, header_column: str):
        self.body_column = body_column
        self.header_column = header_column

        super().__init__(
            input_columns=[self.body_column, self.header_column],
            output_columns=[self.OUTPUT_RESULT_COLUMN],
            name="virus",
        )

    def pre_detect(self, df, debug_mode=False):
        # Assemble the text columns into a single column
        df[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + "\n" + df[self.body_column]

        return df

    def detect(self, df, debug_mode=False):
        text_column = df[self.TMP_DETECTION_INPUT_COLUMN]
        positive_regex = r"(virus)"
        negative_regex = r"(corona[ _]virus)"

        # Pandas str.extract method on columns
        df[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)
        df[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)

        return df

    def post_detect(self, df, debug_mode=False):
        # Boolean operation on pandas column
        df[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] & ~df[self.TMP_NEGATIVE_REGEX_MATCH]
        return df

Custom Transform Methods

If you are not happy with the pre_detect/detect/post_detect transform methods, you may:

Use custom template methods.
Use regular pipeline steps (not inheriting from the MelusineDetector class).
In this example, the prepare/run custom transform methods are used instead of the default pre_detect/detect/post_detect.

class MyCustomDetector(BaseMelusineDetector):
    @property
    def transform_methods(self) -> List[Callable]:
        return [self.prepare, self.run]

    def prepare(self, row, debug_mode=False):
        return row

    def run(self, row, debug_mode=False):
        row[self.output_columns[0]] = "12345"
        return row

To configure custom transform methods you need to:

Inherit from the melusine.base.BaseMelusineDetector class.
Define the transform_methods property.

The transform method will now call prepare and run.

df = pd.DataFrame(
    [
        {"input_col": "test1"},
        {"input_col": "test2"},
    ]
)

detector = MyCustomDetector(input_columns=["input_col"], output_columns=["output_col"], name="custom")

df = detector.transform(df)

We can check that the run method was indeed called.

	input_col	output_col
0	test1	12345
1	test2	12345