Data Management

Overview

The mail-mcp project works with mailing list archives stored in mbox format. Data files are not tracked in Git as they can be regenerated at any time.

Data Directory Structure

data/
├── dev/                        # Maven dev@ mailing list
│   ├── 2002-07.mbox
│   ├── ...
│   └── 2025-12.mbox
└── users/                      # Maven users@ mailing list
    ├── 2002-11.mbox
    ├── ...
    └── 2025-12.mbox

mbox Files

Format

Standard Unix mbox format

Content

ASCII/SGML text with email messages

Size

~1.5GB total for dev@ and users@ lists combined (~750MB each)

Naming

YYYY-MM.mbox (e.g., 2024-10.mbox)

Period

dev@ from July 2002, users@ from November 2002

Data files are excluded from Git (.gitignore). All mbox files can be regenerated using the retrieve-mbox command. This keeps the repository size manageable.

Retrieving Data

Single Month

poetry run retrieve-mbox --date 2024-10

This downloads the mbox file from Apache’s mail archive API and saves it as 2024-10.mbox in the current directory.

Specific Mailing List

poetry run retrieve-mbox --date 2024-10 --list users@maven.apache.org

Default list is dev@maven.apache.org.

Bulk Retrieval

To download multiple months:

# Download all months for 2024
for month in {01..12}; do
  poetry run retrieve-mbox --date 2024-$month
done

# Download specific range
for year in {2023..2024}; do
  for month in {01..12}; do
    poetry run retrieve-mbox --date $year-$month
  done
done

Move downloaded files to the appropriate data/ subdirectory:

mkdir -p data/dev
mv *.mbox data/dev/

Data Source

API

Apache Ponymail API

Endpoint

https://lists.apache.org/api/mbox.lua

Parameters

list (mailing list address), date (YYYY-MM)

Format

Returns mbox file content

Example:

curl "https://lists.apache.org/api/mbox.lua?list=dev@maven.apache.org&date=2024-10" \
  -o 2024-10.mbox

Data Processing Pipeline

  1. Retrieve: Download mbox files from Apache API

  2. Parse: Extract individual email messages

  3. Extract: Extract metadata (JIRA refs, decisions, etc.)

  4. Index: Store in Elasticsearch for searching

  5. Query: Search and retrieve via MCP tools

Storage in Elasticsearch

Once indexed, email data is stored in Elasticsearch.

Index naming

`{prefix}-{list}` (e.g., maven-dev)

Document ID

Email Message-ID header

Data model

See src/mail_mcp/storage/schema.py

Viewing Indexed Data

# List indices
curl http://localhost:59200/_cat/indices?v

# Count documents
curl http://localhost:59200/maven-dev/_count

# Get a specific message
curl http://localhost:59200/maven-dev/_doc/<message-id>

Data Lifecycle

Initial Setup

# 1. Retrieve data
poetry run retrieve-mbox --date 2024-10

# 2. Start Elasticsearch
docker compose up -d elasticsearch

# 3. Index data (when implemented)
poetry run index-mbox data/dev/2024-10.mbox

Updates

New messages are published monthly to Apache’s mail archives. Retrieve the latest month periodically:

# Get current month
poetry run retrieve-mbox --date $(date +%Y-%m)

Cleanup

Remove local mbox files
rm -rf data/
Clear Elasticsearch data
docker compose down -v

Data Size Considerations

Per-month size

Typically 1-5 MB per mbox file

Total archive

~1.5 GB for dev@ and users@ lists (22+ years each)

Elasticsearch index

Approximately 2-3x the mbox size (depends on analysis settings)

Recommended disk space

~2 GB for mbox files, ~5 GB including Elasticsearch indices

The data/ directory is gitignored to keep the repository size manageable. For continuous integration or new development environments, retrieve only the data needed for testing.

Multiple Mailing Lists

The system indexes the following Apache Maven mailing lists:

dev@maven.apache.org

Main development discussions (~750 MB, from July 2002)

users@maven.apache.org

User questions and support (~800 MB, from November 2002)

Additional lists can be added (issues@, commits@, announce@) by configuring MAIL_MCP_MAILING_LISTS.

Each list is stored in its own subdirectory and indexed separately:

data/
├── dev/      # → Elasticsearch index: maven-dev
└── users/    # → Elasticsearch index: maven-users

Backup and Restore

Backup mbox Files

# Create tar archive
tar -czf maven-mbox-backup-$(date +%Y%m%d).tar.gz data/

# Or sync to remote storage
rsync -avz data/ backup-server:/backups/maven-mcp/

Restore from Backup

# Extract tar archive
tar -xzf maven-mbox-backup-YYYYMMDD.tar.gz

# Or sync from remote storage
rsync -avz backup-server:/backups/maven-mcp/ data/

Elasticsearch Backup

See Elasticsearch documentation for snapshot/restore procedures: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html