Getting started with Melusine

Let's run emergency detection with Melusine:

Load a fake email dataset
Load a demonstration pipeline
Run the pipeline
- Apply email cleaning transformations
- Apply emergency detection

Input data

Email datasets typically contain information about:

Email sender
Email recipients
Email subject/header
Email body
Attachments data

The present tutorial only makes use of the body and header data.

	body	header
0	This is an ëmèrgénçy	Help
1	How is life ?	Hey !
2	Urgent update about Mr. Annoying	Latest news
3	Please call me now	URGENT

Code

A typical code for a melusine-based application looks like this:

from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline

# Load an email dataset
df = load_email_data()

# Load a pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")  # (1)!

# Run the pipeline
df = pipeline.transform(df)

This tutorial uses one of the default pipeline configuration demo_pipeline. Melusine users will typically define their own pipeline configuration. See more in the Configurations tutorial

Output Data

The pipeline created extra columns in the dataset. Some columns are temporary variables required by detectors (ex: normalized_body) and some are detection results with direct business value (ex: emergency_result).

	body	header	normalized_body	emergency_result
0	This is an ëmèrgénçy	Help	This is an emergency	True
1	How is life ?	Hey !	How is life ?	False
2	Urgent update about Mr. Annoying	Latest news	Urgent update about Mr. Annoying	False
3	Please call me now	URGENT	Please call me now	True

Pipeline Steps

Illustration of the pipeline used in the present tutorial:

---
title: Demonstration pipeline
---
flowchart LR
    Input[[Email]] --> A(Cleaner)
    A(Cleaner) --> C(Normalizer)
    C --> F(Emergency\nDetector)
    F --> Output[[Qualified Email]]

Cleaner: Cleaning transformations such as uniformization of line breaks (\r\n -> \n).
Normalizer: Text normalisation to delete/replace non utf8 characters (éöà -> eoa).
EmergencyDetector: Detection of urgent emails.

Info

This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:

Email Segmentation : Split email conversation into unitary messages
ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages
Appointment detection : For exemple, detect "construction work will take place on 01/01/2024" as an appointment email.
More on preprocessing in the MelusineTransformers tutorial
More on detectors in the MelusineDetector tutorial

Debug Mode

End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.

from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline

# Load an email dataset
df = load_email_data()

# Activate debug mode
df.debug = True

# Load the default pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")

# Run the pipeline
df = pipeline.transform(df)

A new column debug_emergency is created.

	...	emergency_result	debug_emergency
0	...	True	[details_below]
1	...	False	[details_below]
2	...	False	[details_below]
3	...	True	[details_below]

Inspecting the debug data gives a lot of info:

text: Effective text considered for detection.
EmergencyRegex: melusine used an EmergencyRegex object to run detection.
match_result: The EmergencyRegex did not match the text.
positive_match_data: The EmergencyRegex matched positively the text pattern "Urgent" (Required condition).
negative_match_data: The EmergencyRegex matched negatively the text pattern "Mr. Annoying" (Forbidden condition).
BLACKLIST: Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.

# print(df.iloc[2]["debug_emergency"])
{
  'text': 'Latest news\nUrgent update about Mr. Annoying'},
  'EmergencyRegex': {
    'match_result': False,
    'negative_match_data': {
      'BLACKLIST': [
        {'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}
      ]},
    'neutral_match_data': {},
    'positive_match_data': {
      'DEFAULT': [
        {'match_text': 'Urgent', 'start': 12, 'stop': 18}
      ]
    }
  }