Getting started with Melusine
Let's run emergency detection with Melusine:
- Load a fake email dataset
- Load a demonstration pipeline
- Run the pipeline
- Apply email cleaning transformations
- Apply emergency detection
Input data
Email datasets typically contain information about:
- Email sender
- Email recipients
- Email subject/header
- Email body
- Attachments data
The present tutorial only makes use of the body and header data.
body | header | |
---|---|---|
0 | This is an ëmèrgénçy | Help |
1 | How is life ? | Hey ! |
2 | Urgent update about Mr. Annoying | Latest news |
3 | Please call me now | URGENT |
Code
A typical code for a melusine-based application looks like this:
from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline
# Load an email dataset
df = load_email_data()
# Load a pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")
# Run the pipeline
df = pipeline.transform(df)
Output Data
The pipeline created extra columns in the dataset.
Some columns are temporary variables required by detectors (ex: normalized_body
) and some are detection results with direct business value (ex: emergency_result
).
body | header | normalized_body | emergency_result | |
---|---|---|---|---|
0 | This is an ëmèrgénçy | Help | This is an emergency | True |
1 | How is life ? | Hey ! | How is life ? | False |
2 | Urgent update about Mr. Annoying | Latest news | Urgent update about Mr. Annoying | False |
3 | Please call me now | URGENT | Please call me now | True |
Pipeline Steps
Illustration of the pipeline used in the present tutorial:
Cleaner
: Cleaning transformations such as uniformization of line breaks (\r\n
->\n
).Normalizer
: Text normalisation to delete/replace non utf8 characters (éöà
->eoa
).EmergencyDetector
: Detection of urgent emails.
Info
This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:
- Email Segmentation : Split email conversation into unitary messages
- ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages
- Appointment detection : For exemple, detect "construction work will take place on 01/01/2024" as an appointment email.
- More on preprocessing in the MelusineTransformers tutorial
- More on detectors in the MelusineDetector tutorial
Debug Mode
End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.
from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline
# Load an email dataset
df = load_email_data()
# Activate debug mode
df.debug = True
# Load the default pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")
# Run the pipeline
df = pipeline.transform(df)
A new column debug_emergency
is created.
... | emergency_result | debug_emergency | |
---|---|---|---|
0 | ... | True | [details_below] |
1 | ... | False | [details_below] |
2 | ... | False | [details_below] |
3 | ... | True | [details_below] |
Inspecting the debug data gives a lot of info:
text
: Effective text considered for detection.EmergencyRegex
: melusine used anEmergencyRegex
object to run detection.match_result
: TheEmergencyRegex
did not match the text.positive_match_data
: TheEmergencyRegex
matched positively the text pattern "Urgent" (Required condition).negative_match_data
: TheEmergencyRegex
matched negatively the text pattern "Mr. Annoying" (Forbidden condition).BLACKLIST
: Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.
# print(df.iloc[2]["debug_emergency"])
{
'text': 'Latest news\nUrgent update about Mr. Annoying'},
'EmergencyRegex': {
'match_result': False,
'negative_match_data': {
'BLACKLIST': [
{'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}
]},
'neutral_match_data': {},
'positive_match_data': {
'DEFAULT': [
{'match_text': 'Urgent', 'start': 12, 'stop': 18}
]
}
}