Skip to content

Getting started with Melusine

Let's run emergency detection with Melusine:

  • Load a fake email dataset
  • Load a demonstration pipeline
  • Run the pipeline
    • Apply email cleaning transformations
    • Apply emergency detection

Input data

Email datasets typically contain information about:

  • Email sender
  • Email recipients
  • Email subject/header
  • Email body
  • Attachments data

The present tutorial only makes use of the body and header data.

body header
0 This is an ëmèrgénçy Help
1 How is life ? Hey !
2 Urgent update about Mr. Annoying Latest news
3 Please call me now URGENT

Code

A typical code for a melusine-based application looks like this:

from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline

# Load an email dataset
df = load_email_data()

# Load a pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")  

# Run the pipeline
df = pipeline.transform(df)

Output Data

The pipeline created extra columns in the dataset. Some columns are temporary variables required by detectors (ex: normalized_body) and some are detection results with direct business value (ex: emergency_result).

body header normalized_body emergency_result
0 This is an ëmèrgénçy Help This is an emergency True
1 How is life ? Hey ! How is life ? False
2 Urgent update about Mr. Annoying Latest news Urgent update about Mr. Annoying False
3 Please call me now URGENT Please call me now True

Pipeline Steps

Illustration of the pipeline used in the present tutorial:

  • Cleaner: Cleaning transformations such as uniformization of line breaks (\r\n -> \n).
  • Normalizer: Text normalisation to delete/replace non utf8 characters (éöà -> eoa).
  • EmergencyDetector: Detection of urgent emails.

Info

This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:

  • Email Segmentation : Split email conversation into unitary messages
  • ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages
  • Appointment detection : For exemple, detect "construction work will take place on 01/01/2024" as an appointment email.
  • More on preprocessing in the MelusineTransformers tutorial
  • More on detectors in the MelusineDetector tutorial

Debug Mode

End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.

from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline

# Load an email dataset
df = load_email_data()

# Activate debug mode
df.debug = True

# Load the default pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")

# Run the pipeline
df = pipeline.transform(df)

A new column debug_emergency is created.

... emergency_result debug_emergency
0 ... True [details_below]
1 ... False [details_below]
2 ... False [details_below]
3 ... True [details_below]

Inspecting the debug data gives a lot of info:

  • text: Effective text considered for detection.
  • EmergencyRegex: melusine used an EmergencyRegex object to run detection.
  • match_result: The EmergencyRegex did not match the text.
  • positive_match_data: The EmergencyRegex matched positively the text pattern "Urgent" (Required condition).
  • negative_match_data: The EmergencyRegex matched negatively the text pattern "Mr. Annoying" (Forbidden condition).
  • BLACKLIST: Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.
# print(df.iloc[2]["debug_emergency"])
{
  'text': 'Latest news\nUrgent update about Mr. Annoying'},
  'EmergencyRegex': {
    'match_result': False,
    'negative_match_data': {
      'BLACKLIST': [
        {'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}
      ]},
    'neutral_match_data': {},
    'positive_match_data': {
      'DEFAULT': [
        {'match_text': 'Urgent', 'start': 12, 'stop': 18}
      ]
    }
  }