Data Management
Overview
The mail-mcp project works with mailing list archives stored in mbox format. Data files are not tracked in Git as they can be regenerated at any time.
Data Directory Structure
data/
├── dev/ # Maven dev@ mailing list
│ ├── 2002-07.mbox
│ ├── ...
│ └── 2025-12.mbox
└── users/ # Maven users@ mailing list
├── 2002-11.mbox
├── ...
└── 2025-12.mbox
mbox Files
- Format
-
Standard Unix mbox format
- Content
-
ASCII/SGML text with email messages
- Size
-
~1.5GB total for dev@ and users@ lists combined (~750MB each)
- Naming
-
YYYY-MM.mbox(e.g.,2024-10.mbox) - Period
-
dev@ from July 2002, users@ from November 2002
|
Data files are excluded from Git ( |
Retrieving Data
Single Month
poetry run retrieve-mbox --date 2024-10
This downloads the mbox file from Apache’s mail archive API and saves it as 2024-10.mbox in the current directory.
Specific Mailing List
poetry run retrieve-mbox --date 2024-10 --list users@maven.apache.org
Default list is dev@maven.apache.org.
Bulk Retrieval
To download multiple months:
# Download all months for 2024
for month in {01..12}; do
poetry run retrieve-mbox --date 2024-$month
done
# Download specific range
for year in {2023..2024}; do
for month in {01..12}; do
poetry run retrieve-mbox --date $year-$month
done
done
|
Move downloaded files to the appropriate
|
Data Source
- API
-
Apache Ponymail API
- Endpoint
- Parameters
-
list(mailing list address),date(YYYY-MM) - Format
-
Returns mbox file content
Example:
curl "https://lists.apache.org/api/mbox.lua?list=dev@maven.apache.org&date=2024-10" \
-o 2024-10.mbox
Data Processing Pipeline
-
Retrieve: Download mbox files from Apache API
-
Parse: Extract individual email messages
-
Extract: Extract metadata (JIRA refs, decisions, etc.)
-
Index: Store in Elasticsearch for searching
-
Query: Search and retrieve via MCP tools
Storage in Elasticsearch
Once indexed, email data is stored in Elasticsearch.
- Index naming
-
`{prefix}-{list}` (e.g.,
maven-dev) - Document ID
-
Email Message-ID header
- Data model
-
See
src/mail_mcp/storage/schema.py
Data Lifecycle
Initial Setup
# 1. Retrieve data
poetry run retrieve-mbox --date 2024-10
# 2. Start Elasticsearch
docker compose up -d elasticsearch
# 3. Index data (when implemented)
poetry run index-mbox data/dev/2024-10.mbox
Data Size Considerations
- Per-month size
-
Typically 1-5 MB per mbox file
- Total archive
-
~1.5 GB for dev@ and users@ lists (22+ years each)
- Elasticsearch index
-
Approximately 2-3x the mbox size (depends on analysis settings)
- Recommended disk space
-
~2 GB for mbox files, ~5 GB including Elasticsearch indices
|
The |
Multiple Mailing Lists
The system indexes the following Apache Maven mailing lists:
- dev@maven.apache.org
-
Main development discussions (~750 MB, from July 2002)
- users@maven.apache.org
-
User questions and support (~800 MB, from November 2002)
Additional lists can be added (issues@, commits@, announce@) by configuring MAIL_MCP_MAILING_LISTS.
Each list is stored in its own subdirectory and indexed separately:
data/
├── dev/ # → Elasticsearch index: maven-dev
└── users/ # → Elasticsearch index: maven-users
Backup and Restore
Backup mbox Files
# Create tar archive
tar -czf maven-mbox-backup-$(date +%Y%m%d).tar.gz data/
# Or sync to remote storage
rsync -avz data/ backup-server:/backups/maven-mcp/
Restore from Backup
# Extract tar archive
tar -xzf maven-mbox-backup-YYYYMMDD.tar.gz
# Or sync from remote storage
rsync -avz backup-server:/backups/maven-mcp/ data/
Elasticsearch Backup
See Elasticsearch documentation for snapshot/restore procedures: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html