Getting started with Melusine

Let's run emergency detection with Melusine:

Load a fake email dataset
Load a demonstration pipeline
Run the pipeline
- Apply email cleaning transformations
- Apply emergency detection

Input data

Email datasets typically contain information about:

Email sender
Email recipients
Email subject/header
Email body
Attachments data

The present tutorial only makes use of the body and header data.

	body	header
0	This is an ëmèrgénçy	Help
1	How is life ?	Hey !
2	Urgent update about Mr. Annoying	Latest news
3	Please call me now	URGENT

Code

A typical code for a melusine-based application looks like this:

from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline

# Load an email dataset
df = load_email_data()

# Load a pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")  This tutorial uses one of the default pipeline configuration demo_pipeline. Melusine users will typically define their own pipeline configuration.
   See more in the Configurations tutorial


# Run the pipeline
df = pipeline.transform(df)

Output Data

The pipeline created extra columns in the dataset. Some columns are temporary variables required by detectors (ex: normalized_body) and some are detection results with direct business value (ex: emergency_result).

	body	header	normalized_body	emergency_result
0	This is an ëmèrgénçy	Help	This is an emergency	True
1	How is life ?	Hey !	How is life ?	False
2	Urgent update about Mr. Annoying	Latest news	Urgent update about Mr. Annoying	False
3	Please call me now	URGENT	Please call me now	True

Pipeline Steps

Illustration of the pipeline used in the present tutorial:

Cleaner: Cleaning transformations such as uniformization of line breaks (\r\n -> \n).
Normalizer: Text normalisation to delete/replace non utf8 characters (éöà -> eoa).
EmergencyDetector: Detection of urgent emails.

Info

This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:

Email Segmentation : Split email conversation into unitary messages
ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages
Appointment detection : For exemple, detect "construction work will take place on 01/01/2024" as an appointment email.
More on preprocessing in the MelusineTransformers tutorial
More on detectors in the MelusineDetector tutorial

Debug Mode

End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.

from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline

# Load an email dataset
df = load_email_data()

# Activate debug mode
df.debug = True

# Load the default pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")

# Run the pipeline
df = pipeline.transform(df)

A new column debug_emergency is created.

	...	emergency_result	debug_emergency
0	...	True	[details_below]
1	...	False	[details_below]
2	...	False	[details_below]
3	...	True	[details_below]

Inspecting the debug data gives a lot of info:

text: Effective text considered for detection.
EmergencyRegex: melusine used an EmergencyRegex object to run detection.
match_result: The EmergencyRegex did not match the text.
positive_match_data: The EmergencyRegex matched positively the text pattern "Urgent" (Required condition).
negative_match_data: The EmergencyRegex matched negatively the text pattern "Mr. Annoying" (Forbidden condition).
BLACKLIST: Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.

# print(df.iloc[2]["debug_emergency"])
{
  'text': 'Latest news\nUrgent update about Mr. Annoying'},
  'EmergencyRegex': {
    'match_result': False,
    'negative_match_data': {
      'BLACKLIST': [
        {'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}
      ]},
    'neutral_match_data': {},
    'positive_match_data': {
      'DEFAULT': [
        {'match_text': 'Urgent', 'start': 12, 'stop': 18}
      ]
    }
  }