Advanced Melusine Detectors
This tutorial presents the advanced features of the MelusineDetector
class:
- Debug mode
- Row wise methods vs DataFrame wise methods
- Custom transform methods
Debug mode
MelusineDetector
are designed to be easily debugged. For that purpose, the pre-detect
/detect
/post-detect
methods all have a debug_mode
argument. The debug mode is activated by setting the debug
attribute of a dataframe to True
.
Warning
Debug mode activation is backend dependent. With a DictBackend, tou should use my_dict["debug"] = True
When debug mode is activated, a column named DETECTOR_NAME_debug
containing an empty dictionary is automatically created.
Populating this debug dict with debug info is then left to the user's responsibility.
Example of a detector with debug data:
class MyVirusDetector(MelusineDetector):
OUTPUT_RESULT_COLUMN = "virus_result"
TMP_DETECTION_INPUT_COLUMN = "detection_input"
TMP_POSITIVE_REGEX_MATCH = "positive_regex_match"
TMP_NEGATIVE_REGEX_MATCH = "negative_regex_match"
def __init__(self, body_column: str, header_column: str):
self.body_column = body_column
self.header_column = header_column
super().__init__(
input_columns=[self.body_column, self.header_column],
output_columns=[self.OUTPUT_RESULT_COLUMN],
name="virus",
)
def pre_detect(self, row, debug_mode=False):
effective_text = row[self.header_column] + "\n" + row[self.body_column]
row[self.TMP_DETECTION_INPUT_COLUMN] = effective_text
if debug_mode:
row[self.debug_dict_col] = {"detection_input": row[self.TMP_DETECTION_INPUT_COLUMN]}
return row
def detect(self, row, debug_mode=False):
text = row[self.TMP_DETECTION_INPUT_COLUMN]
positive_regex = r"virus"
negative_regex = r"corona[ _]virus"
positive_match = re.search(positive_regex, text)
negative_match = re.search(negative_regex, text)
row[self.TMP_POSITIVE_REGEX_MATCH] = bool(positive_match)
row[self.TMP_NEGATIVE_REGEX_MATCH] = bool(negative_match)
if debug_mode:
positive_match_text = (
positive_match.string[positive_match.start() : positive_match.end()] if positive_match else None
)
negative_match_text = (
positive_match.string[negative_match.start() : negative_match.end()] if negative_match else None
)
row[self.debug_dict_col].update(
{
"positive_match_data": {"result": bool(positive_match), "match_text": positive_match_text},
"negative_match_data": {"result": bool(negative_match), "match_text": negative_match_text},
}
)
return row
def post_detect(self, row, debug_mode=False):
if row[self.TMP_POSITIVE_REGEX_MATCH] and not row[self.TMP_NEGATIVE_REGEX_MATCH]:
row[self.OUTPUT_RESULT_COLUMN] = True
else:
row[self.OUTPUT_RESULT_COLUMN] = False
return row
In the end, an extra column is created containing debug data:
virus_result | debug_virus | |
---|---|---|
0 | True | {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} |
1 | False | {'detection_input': '...', 'positive_match_data': {'result': False, 'match_text': None}, 'negative_match_data': {'result': False, 'match_text': None}} |
2 | True | {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} |
3 | False | {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}} |
Row Methods VS Dataframe Methods
There are two ways to use the pre-detect
/detect
/post-detect
methods:
- Row wise: The method works on a single row of a
DataFrame
. In that case, a map-like method is used to apply it on an entire dataframe (typicallypandas.DataFrame.apply
is used with the PandasBackend). - Dataframe wise: The method works directly on the entire DataFrame.
Tip
Using row wise methods make your code backend independent. You may
switch from a PandasBackend
to a DictBackend
at any time.
The PandasBackend
also supports multiprocessing for row wise methods.
To use row wise methods, you just need to name the first parameter of "row". Otherwise, dataframe wise transformations are used.
Example of a Detector with dataframe wise method (works with a PandasBackend only).
class MyVirusDetector(MelusineDetector):
"""
Detect if the text expresses dissatisfaction.
"""
# Dataframe column names
OUTPUT_RESULT_COLUMN = "virus_result"
TMP_DETECTION_INPUT_COLUMN = "detection_input"
TMP_POSITIVE_REGEX_MATCH = "positive_regex_match"
TMP_NEGATIVE_REGEX_MATCH = "negative_regex_match"
def __init__(self, body_column: str, header_column: str):
self.body_column = body_column
self.header_column = header_column
super().__init__(
input_columns=[self.body_column, self.header_column],
output_columns=[self.OUTPUT_RESULT_COLUMN],
name="virus",
)
def pre_detect(self, df, debug_mode=False):
# Assemble the text columns into a single column
df[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + "\n" + df[self.body_column]
return df
def detect(self, df, debug_mode=False):
text_column = df[self.TMP_DETECTION_INPUT_COLUMN]
positive_regex = r"(virus)"
negative_regex = r"(corona[ _]virus)"
# Pandas str.extract method on columns
df[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)
df[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)
return df
def post_detect(self, df, debug_mode=False):
# Boolean operation on pandas column
df[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] & ~df[self.TMP_NEGATIVE_REGEX_MATCH]
return df
Custom Transform Methods
If you are not happy with the pre_detect
/detect
/post_detect
transform methods, you may:
- Use custom template methods.
-
Use regular pipeline steps (not inheriting from the
MelusineDetector
class). -
In this example, the
prepare
/run
custom transform methods are used instead of the defaultpre_detect
/detect
/post_detect
.
class MyCustomDetector(BaseMelusineDetector):
@property
def transform_methods(self) -> List[Callable]:
return [self.prepare, self.run]
def prepare(self, row, debug_mode=False):
return row
def run(self, row, debug_mode=False):
row[self.output_columns[0]] = "12345"
return row
To configure custom transform methods you need to:
- Inherit from the
melusine.base.BaseMelusineDetector
class. - Define the
transform_methods
property.
The transform
method will now call prepare
and run
.
df = pd.DataFrame(
[
{"input_col": "test1"},
{"input_col": "test2"},
]
)
detector = MyCustomDetector(input_columns=["input_col"], output_columns=["output_col"], name="custom")
df = detector.transform(df)
We can check that the run
method was indeed called.
input_col | output_col | |
---|---|---|
0 | test1 | 12345 |
1 | test2 | 12345 |