Skip to content

Getting started with Melusine

Let's run emergency detection with Melusine:

  • Load a fake email dataset
  • Load a demonstration pipeline
  • Run the pipeline
    • Apply email cleaning transformations
    • Apply emergency detection

Input data

Email datasets typically contain information about:

  • Email sender
  • Email recipients
  • Email subject/header
  • Email body
  • Attachments data

The present tutorial only makes use of the body and header data.

body header
0 This is an ëmèrgénçy Help
1 How is life ? Hey !
2 Urgent update about Mr. Annoying Latest news
3 Please call me now URGENT

Code

A typical code for a melusine-based application looks like this:

from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline

# Load an email dataset
df = load_email_data()

# Load a pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")  # (1)!

# Run the pipeline
df = pipeline.transform(df)
  1. This tutorial uses one of the default pipeline configuration demo_pipeline. Melusine users will typically define their own pipeline configuration. See more in the Configurations tutorial

Output Data

The pipeline created extra columns in the dataset. Some columns are temporary variables required by detectors (ex: normalized_body) and some are detection results with direct business value (ex: emergency_result).

body header normalized_body emergency_result
0 This is an ëmèrgénçy Help This is an emergency True
1 How is life ? Hey ! How is life ? False
2 Urgent update about Mr. Annoying Latest news Urgent update about Mr. Annoying False
3 Please call me now URGENT Please call me now True

Pipeline Steps

Illustration of the pipeline used in the present tutorial:

---
title: Demonstration pipeline
---
flowchart LR
    Input[[Email]] --> A(Cleaner)
    A(Cleaner) --> C(Normalizer)
    C --> F(Emergency\nDetector)
    F --> Output[[Qualified Email]]
  • Cleaner: Cleaning transformations such as uniformization of line breaks (\r\n -> \n).
  • Normalizer: Text normalisation to delete/replace non utf8 characters (éöà -> eoa).
  • EmergencyDetector: Detection of urgent emails.

Info

This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:

  • Email Segmentation : Split email conversation into unitary messages
  • ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages
  • Appointment detection : For exemple, detect "construction work will take place on 01/01/2024" as an appointment email.
  • More on preprocessing in the MelusineTransformers tutorial
  • More on detectors in the MelusineDetector tutorial

Debug Mode

End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.

from melusine.data import load_email_data
from melusine.pipeline import MelusinePipeline

# Load an email dataset
df = load_email_data()

# Activate debug mode
df.debug = True

# Load the default pipeline
pipeline = MelusinePipeline.from_config("demo_pipeline")

# Run the pipeline
df = pipeline.transform(df)

A new column debug_emergency is created.

... emergency_result debug_emergency
0 ... True [details_below]
1 ... False [details_below]
2 ... False [details_below]
3 ... True [details_below]

Inspecting the debug data gives a lot of info:

  • text: Effective text considered for detection.
  • EmergencyRegex: melusine used an EmergencyRegex object to run detection.
  • match_result: The EmergencyRegex did not match the text.
  • positive_match_data: The EmergencyRegex matched positively the text pattern "Urgent" (Required condition).
  • negative_match_data: The EmergencyRegex matched negatively the text pattern "Mr. Annoying" (Forbidden condition).
  • BLACKLIST: Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.
# print(df.iloc[2]["debug_emergency"])
{
  'text': 'Latest news\nUrgent update about Mr. Annoying'},
  'EmergencyRegex': {
    'match_result': False,
    'negative_match_data': {
      'BLACKLIST': [
        {'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}
      ]},
    'neutral_match_data': {},
    'positive_match_data': {
      'DEFAULT': [
        {'match_text': 'Urgent', 'start': 12, 'stop': 18}
      ]
    }
  }