Getting started with Melusine
Let's run emergency detection with Melusine:
- Load a fake email dataset
- Load a demonstration pipeline
- Run the pipeline
- Apply email cleaning transformations
- Apply emergency detection
Input data
Email datasets typically contain information about:
- Email sender
- Email recipients
- Email subject/header
- Email body
- Attachments data
The present tutorial only makes use of the body and header data.
body | header | |
---|---|---|
0 | This is an ëmèrgénçy | Help |
1 | How is life ? | Hey ! |
2 | Urgent update about Mr. Annoying | Latest news |
3 | Please call me now | URGENT |
Code
A typical code for a melusine-based application looks like this:
from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline
# Load an email dataset
df = load_email_data()
# Load a pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline") # (1)!
# Run the pipeline
df = pipeline.transform(df)
- This tutorial uses one of the default pipeline configuration
demo_pipeline
. Melusine users will typically define their own pipeline configuration. See more in the Configurations tutorial
Output Data
The pipeline created extra columns in the dataset.
Some columns are temporary variables required by detectors (ex: normalized_body
) and some are detection results with direct business value (ex: emergency_result
).
body | header | normalized_body | emergency_result | |
---|---|---|---|---|
0 | This is an ëmèrgénçy | Help | This is an emergency | True |
1 | How is life ? | Hey ! | How is life ? | False |
2 | Urgent update about Mr. Annoying | Latest news | Urgent update about Mr. Annoying | False |
3 | Please call me now | URGENT | Please call me now | True |
Pipeline Steps
Illustration of the pipeline used in the present tutorial:
---
title: Demonstration pipeline
---
flowchart LR
Input[[Email]] --> A(Cleaner)
A(Cleaner) --> C(Normalizer)
C --> F(Emergency\nDetector)
F --> Output[[Qualified Email]]
Cleaner
: Cleaning transformations such as uniformization of line breaks (\r\n
->\n
).Normalizer
: Text normalisation to delete/replace non utf8 characters (éöà
->eoa
).EmergencyDetector
: Detection of urgent emails.
Info
This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:
- Email Segmentation : Split email conversation into unitary messages
- ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages
- Appointment detection : For exemple, detect "construction work will take place on 01/01/2024" as an appointment email.
- More on preprocessing in the MelusineTransformers tutorial
- More on detectors in the MelusineDetector tutorial
Debug Mode
End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.
from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline
# Load an email dataset
df = load_email_data()
# Activate debug mode
df.debug = True
# Load the default pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")
# Run the pipeline
df = pipeline.transform(df)
A new column debug_emergency
is created.
... | emergency_result | debug_emergency | |
---|---|---|---|
0 | ... | True | [details_below] |
1 | ... | False | [details_below] |
2 | ... | False | [details_below] |
3 | ... | True | [details_below] |
Inspecting the debug data gives a lot of info:
text
: Effective text considered for detection.EmergencyRegex
: melusine used anEmergencyRegex
object to run detection.match_result
: TheEmergencyRegex
did not match the text.positive_match_data
: TheEmergencyRegex
matched positively the text pattern "Urgent" (Required condition).negative_match_data
: TheEmergencyRegex
matched negatively the text pattern "Mr. Annoying" (Forbidden condition).BLACKLIST
: Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.
# print(df.iloc[2]["debug_emergency"])
{
'text': 'Latest news\nUrgent update about Mr. Annoying'},
'EmergencyRegex': {
'match_result': False,
'negative_match_data': {
'BLACKLIST': [
{'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}
]},
'neutral_match_data': {},
'positive_match_data': {
'DEFAULT': [
{'match_text': 'Urgent', 'start': 12, 'stop': 18}
]
}
}