Data Management

Overview

The mail-mcp project works with mailing list archives stored in mbox format. Data files are not tracked in Git as they can be regenerated at any time.

Data Directory Structure

data/
├── dev/                        # Maven dev@ mailing list
│   ├── 2002-07.mbox
│   ├── ...
│   └── 2025-12.mbox
└── users/                      # Maven users@ mailing list
    ├── 2002-11.mbox
    ├── ...
    └── 2025-12.mbox

mbox Files

Format: Standard Unix mbox format
Content: ASCII/SGML text with email messages
Size: ~1.5GB total for dev@ and users@ lists combined (~750MB each)
Naming: YYYY-MM.mbox (e.g., 2024-10.mbox)
Period: dev@ from July 2002, users@ from November 2002

Data files are excluded from Git (.gitignore). All mbox files can be regenerated using the retrieve-mbox command. This keeps the repository size manageable.

Retrieving Data

Single Month

poetry run retrieve-mbox --date 2024-10

This downloads the mbox file from Apache’s mail archive API and saves it as 2024-10.mbox in the current directory.

Specific Mailing List

poetry run retrieve-mbox --date 2024-10 --list users@maven.apache.org

Default list is dev@maven.apache.org.

Bulk Retrieval

To download multiple months:

# Download all months for 2024
for month in {01..12}; do
  poetry run retrieve-mbox --date 2024-$month
done

# Download specific range
for year in {2023..2024}; do
  for month in {01..12}; do
    poetry run retrieve-mbox --date $year-$month
  done
done

Move downloaded files to the appropriate data/ subdirectory:

mkdir -p data/dev
mv *.mbox data/dev/

Data Source

API: Apache Ponymail API
Endpoint: https://lists.apache.org/api/mbox.lua
Parameters: list (mailing list address), date (YYYY-MM)
Format: Returns mbox file content

Example:

curl "https://lists.apache.org/api/mbox.lua?list=dev@maven.apache.org&date=2024-10" \
  -o 2024-10.mbox

Data Processing Pipeline

Retrieve: Download mbox files from Apache API
Parse: Extract individual email messages
Extract: Extract metadata (JIRA refs, decisions, etc.)
Index: Store in Elasticsearch for searching
Query: Search and retrieve via MCP tools

Storage in Elasticsearch

Once indexed, email data is stored in Elasticsearch.

Index naming: `{prefix}-{list}` (e.g., maven-dev)
Document ID: Email Message-ID header
Data model: See src/mail_mcp/storage/schema.py

Viewing Indexed Data

# List indices
curl http://localhost:59200/_cat/indices?v

# Count documents
curl http://localhost:59200/maven-dev/_count

# Get a specific message
curl http://localhost:59200/maven-dev/_doc/<message-id>

Data Lifecycle

Initial Setup

# 1. Retrieve data
poetry run retrieve-mbox --date 2024-10

# 2. Start Elasticsearch
docker compose up -d elasticsearch

# 3. Index data (when implemented)
poetry run index-mbox data/dev/2024-10.mbox

Updates

New messages are published monthly to Apache’s mail archives. Retrieve the latest month periodically:

# Get current month
poetry run retrieve-mbox --date $(date +%Y-%m)

Cleanup

Remove local mbox files

rm -rf data/

Clear Elasticsearch data

docker compose down -v

Data Size Considerations

Per-month size: Typically 1-5 MB per mbox file
Total archive: ~1.5 GB for dev@ and users@ lists (22+ years each)
Elasticsearch index: Approximately 2-3x the mbox size (depends on analysis settings)
Recommended disk space: ~2 GB for mbox files, ~5 GB including Elasticsearch indices

The data/ directory is gitignored to keep the repository size manageable. For continuous integration or new development environments, retrieve only the data needed for testing.

Multiple Mailing Lists

The system indexes the following Apache Maven mailing lists:

dev@maven.apache.org: Main development discussions (~750 MB, from July 2002)
users@maven.apache.org: User questions and support (~800 MB, from November 2002)

Additional lists can be added (issues@, commits@, announce@) by configuring MAIL_MCP_MAILING_LISTS.

Each list is stored in its own subdirectory and indexed separately:

data/
├── dev/      # → Elasticsearch index: maven-dev
└── users/    # → Elasticsearch index: maven-users

Backup and Restore

Backup mbox Files

# Create tar archive
tar -czf maven-mbox-backup-$(date +%Y%m%d).tar.gz data/

# Or sync to remote storage
rsync -avz data/ backup-server:/backups/maven-mcp/

Restore from Backup

# Extract tar archive
tar -xzf maven-mbox-backup-YYYYMMDD.tar.gz

# Or sync from remote storage
rsync -avz backup-server:/backups/maven-mcp/ data/

Elasticsearch Backup

See Elasticsearch documentation for snapshot/restore procedures: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html