ADR-0001: Storage and Access Strategy for Mailing List Data
Status
Accepted
Decision Date: 2025-01-16
Key Decision: Elasticsearch selected as primary storage backend for initial implementation.
Context
The mail-mcp project needs to make Apache Maven mailing list archives accessible to LLMs via MCP (Model Context Protocol). The data consists of 20+ years of email discussions (~560 monthly mbox files, ~1.5GB) from Apache Maven mailing lists (dev@ and users@).
Data Characteristics
-
Volume: ~1.5GB currently (dev@ ~750MB, users@ ~800MB), growing monthly
-
Format: mbox (standard Unix mailbox format)
-
Structure: Email messages with threads (In-Reply-To, References headers)
-
Source: Apache Ponymail Foal REST API (https://lists.apache.org/api/mbox.lua)
-
Update frequency: New data arrives continuously (daily activity)
Primary Use Case
Enable LLMs to research Maven development discussion history and track decisions across time.
Core scenarios:
- Decision research
-
-
"Has topic X been discussed?"
-
"Was a decision made about feature Y?"
-
"What alternatives were considered for approach Z?"
-
- Decision lifecycle tracking
-
-
Active discussions (no decision yet, alternatives being explored)
-
Decisions reached (consensus documented, alternatives evaluated)
-
Implementation announced (decision executed, release made)
-
Decisions deprecated (no longer relevant, superseded by newer decisions)
-
- Temporal tracking
-
-
"When was feature X decided/implemented/released?"
-
"What was discussed in Q4 2023 about topic Y?"
-
"Show decisions made but not yet implemented"
-
- Cross-reference integration
-
-
Mail discussion → Jira issue (MAVEN-1234 references)
-
Mail announcement → GitHub release/commit
-
Mail decision → Confluence documentation
-
Mail thread → Code implementation
-
Future: Combine with other MCPs (Jira, GitHub, Confluence, codebase MCPs)
-
Requirements
-
Self-contained: MCP should be deployable as a standalone unit
-
Regenerable: Complete system rebuild from source must be possible
-
Containerized: Must run in Docker/Kubernetes environments
-
Query capabilities:
-
Full-text search across email content
-
Metadata filtering (date, sender, subject, list)
-
Thread reconstruction and navigation
-
Decision indicator extraction (VOTE, consensus, agreed, RESOLVED)
-
External reference detection (JIRA-NNNN, GitHub PR #NNN, release versions)
-
Temporal queries (date ranges, decision timelines)
-
Potentially semantic/vector search for conceptual queries
-
-
Performance: Query responses suitable for LLM interaction (< 5s for typical queries)
-
Multiple lists: Support dev@, users@, and other Apache Maven lists
-
Cross-MCP compatibility: Design for integration with Jira, GitHub, Confluence MCPs
Decision
We will implement a database-backed architecture using Elasticsearch as the primary storage backend.
Primary Backend: Elasticsearch
Chosen for initial implementation based on use case alignment:
-
Full-text search: Core requirement for decision research queries
-
Metadata extraction: Excellent support for indexed fields (decision indicators, external references, temporal markers)
-
Temporal queries: Native date range filtering and aggregations
-
Future extensibility: Built-in vector search support for semantic queries
-
Mature ecosystem: Well-documented, widely deployed, strong tooling
-
Performance: Handles 1.5GB+ efficiently with room to scale
-
Decision tracking: Can model decision lifecycle via status fields and date ranges
-
Cross-reference support: Easy to index and query external references (JIRA-NNNN, PR #NNN)
Use case fit: * ✅ "Has topic X been discussed?" → Full-text search * ✅ "Was decision made about Y?" → Metadata filtering (decision_status field) * ✅ "When was feature Z implemented?" → Temporal queries + external references * ✅ Thread reconstruction → Application logic using References/In-Reply-To * ✅ Cross-MCP integration → Store and query external IDs
Alternative Backends (Future Consideration)
Neo4j and SQLite remain documented as alternatives for specific scenarios:
- Neo4j
-
Graph-focused scenarios if thread navigation becomes primary use case
- SQLite
-
Truly embedded deployments without container infrastructure
The architecture maintains a storage interface abstraction to allow backend switching if requirements change.
Architecture Overview
┌─────────────────────────────────────────────────────┐
│ MCP Server (mail-mcp) │
│ ┌──────────────────────────────────────────────┐ │
│ │ MCP Tools (search, retrieve, thread, sync) │ │
│ └──────────────────┬───────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Storage Interface (abstract) │ │
│ └──────────────────┬───────────────────────────┘ │
│ ↓ ↓ ↓ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Neo4j │ │ ES │ │ SQLite │ │
│ │ Storage │ │ Storage │ │ Storage │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────┘
↓ (rebuild/sync)
┌─────────────────────────────────────────────────────┐
│ Apache Ponymail API │
│ https://lists.apache.org/api/mbox.lua │
└─────────────────────────────────────────────────────┘
Storage Backend Selection
Primary Option 1: Elasticsearch
Best for: Full-text search, aggregations, general-purpose queries
Rationale: * Excellent full-text search capabilities (inverted indices, relevance scoring) * Mature, well-documented, widely deployed * Supports vector search (for future semantic capabilities) * Handles 1.5GB+ datasets efficiently * Good aggregation support (e.g., "top contributors", "activity over time")
Implementation considerations: * Thread reconstruction via application logic (query by References/In-Reply-To) * Index structure: One index per mailing list, documents = email messages * Metadata fields: from, to, subject, date, list, thread_id, message_id, in_reply_to
Primary Option 2: Neo4j
Best for: Thread navigation, relationship queries, graph analysis
Rationale: * Native graph model perfectly represents email threads * Cypher queries excel at "show me this discussion tree" * Can analyze social graphs (who replies to whom) * Full-text search via built-in indices * Visualizing conversation flows
Implementation considerations: * Node types: Message, Person, Thread * Relationships: REPLIES_TO, SENT_BY, PART_OF_THREAD * Properties on nodes: subject, body, date, list * Hybrid approach: Neo4j for relationships + ES for full-text (if both needed)
Fallback Option: SQLite
Best for: True self-contained deployment, development/testing
Rationale: * Zero external dependencies (embedded in application) * No separate container/service required * Sufficient for moderate query loads * FTS5 provides acceptable full-text search
Limitations: * Less powerful full-text search than ES * Thread queries require recursive CTEs (more complex) * Not ideal for concurrent access at scale
Deployment Configuration
Docker Compose (Recommended for Development)
services:
mail-mcp:
build: .
environment:
STORAGE_BACKEND: elasticsearch # or: neo4j, sqlite
ES_URL: http://elasticsearch:9200
NEO4J_URI: bolt://neo4j:7687
volumes:
- ./data:/app/data # mbox cache
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
discovery.type: single-node
xpack.security.enabled: false
volumes:
- es-data:/usr/share/elasticsearch/data
neo4j:
image: neo4j:5
environment:
NEO4J_AUTH: none
volumes:
- neo4j-data:/data
Data Lifecycle Operations
Initial Population
# On first deployment or after reset
mcp-mail sync --initial \
--list dev@maven.apache.org \
--from 2002-11 \
--to 2025-01
-
Download mbox files via
retrieve-mbox.groovy(reuse existing script) -
Parse mbox files and extract messages
-
Extract metadata and indicators (see Metadata Extraction below)
-
Build thread relationships (References/In-Reply-To headers)
-
Index into selected storage backend
-
Track sync status (last indexed date per list)
Metadata Extraction
To support decision tracking and cross-reference queries, the indexing pipeline must extract:
Decision indicators:
-
Vote markers:
[VOTE],[RESULT],+1,-1,+0 -
Decision keywords: "decided", "consensus", "agreed", "RESOLVED", "WONTFIX"
-
Action items: "TODO", "ACTION:", "implemented in"
-
Status markers: "CLOSED", "REOPENED", "deprecated"
External references:
-
Jira issues:
MAVEN-1234,MNG-5678 -
GitHub: PR references
#123, commit SHAs -
Confluence: Wiki page URLs
-
Release versions:
4.0.0,maven-3.9.0 -
CVE references:
CVE-2023-1234
Temporal markers:
-
Decision dates (extract from "decided on YYYY-MM-DD")
-
Release dates (from announcement subjects)
-
Milestone references ("for 4.0 release")
Implementation approach:
-
Regex patterns for structured extraction
-
Store as indexed fields for filtering
-
Link references to enable cross-MCP queries
-
Future: NER (Named Entity Recognition) for more sophisticated extraction
Query Interface (MCP Tools)
The MCP will expose tools such as:
-
search_emails: Full-text search with filters (date, sender, list) -
get_thread: Retrieve complete email thread by message ID -
get_message: Retrieve single email by ID -
list_threads: Browse recent/active threads -
sync_list: Trigger manual sync of a mailing list
Consequences
Positive
-
Flexible deployment: ES for search-heavy, Neo4j for thread-heavy, SQLite for embedded
-
Regenerable: Can rebuild entire system from Apache API
-
Performant: Database acceleration enables fast queries on 20+ years of data
-
Scalable: ES/Neo4j handle growth beyond current 1.5GB
-
Container-native: Docker Compose for dev, K8s for production
-
Thread-aware: Proper representation of email discussions as graphs/relationships
-
Multiple backends: Can choose optimal storage per deployment scenario
Negative
-
Infrastructure complexity: Requires running ES or Neo4j container
-
Resource overhead: ES/Neo4j consume significant RAM (1GB+ each)
-
Sync latency: Initial population takes time (parse 1.5GB, index)
-
Storage backend abstraction: More code to maintain multiple implementations
-
Database management: Backup/restore, monitoring, tuning
Open Questions
-
Vector search: Should we plan for semantic search from the start?
-
Context: Primary use case involves conceptual queries ("How do they handle X?", "Find discussions about approach Y") which benefit from semantic search
-
Arguments for early implementation:
-
Conceptual/semantic queries are core to the use case
-
Elasticsearch supports vector fields natively
-
Enables "similar discussions" queries
-
Better matches LLM query patterns
-
-
Arguments for deferring:
-
Adds complexity (embedding generation, vector storage, similarity tuning)
-
Keyword + metadata search may be sufficient initially
-
Can be added incrementally (re-index with embeddings later)
-
Focus on core functionality first
-
-
Recommendation: Defer but design for it
-
Start with keyword/metadata search
-
Design schema with embedding field placeholder
-
Re-evaluate after testing with real queries
-
Consider when: Core search working, query patterns observed, integration with LLM established
-
-
-
Multi-list strategy: One database for all lists, or separate instances?
-
Recommendation: Single database, separate indices/namespaces per list
-
-
Thread algorithm: How to handle malformed/missing References headers?
-
Recommendation: Fall back to subject line matching with fuzzy logic
-
-
Update strategy: Push (webhook) or pull (periodic sync)?
-
Recommendation: Pull initially (cron/periodic), push as enhancement
-
-
Citation/Quote Handling: How should we handle quoted content in email replies?
-
Problem: Email threads often contain quote pyramids where previous messages are repeatedly quoted, leading to:
-
Storage bloat (same text repeated across multiple emails)
-
Context pollution (LLMs see duplicate content, wasting tokens)
-
Search relevance issues (same content appears in multiple results)
-
Attribution confusion (difficulty determining who said what originally)
-
-
Potential approaches:
-
Store both full text and "effective content" (new content only)
-
Detect quoted sections using standard markers (
>,|, attribution lines) -
Maintain citation graph showing which messages quote which
-
Filter quoted content by default in LLM retrieval
-
Optionally expose full content when needed for verification
-
-
Storage format considerations:
-
body_full: Complete email including all quotes -
body_effective: Only new content contributed by this message -
quotes: Array of references to quoted messages with snippets -
Index both separately for different search scenarios
-
-
Detection challenges:
-
Standard quote markers (>, |, "On [date] wrote:")
-
Code blocks that may contain > characters
-
Nested quotes from long threads
-
Top-posting vs. inline reply styles
-
Non-standard email clients
-
-
Benefits of filtering:
-
30-50% reduction in indexed content (estimated)
-
Better token efficiency for LLM queries
-
Improved search result relevance
-
Clearer attribution and thread reconstruction
-
-
Implementation options:
-
Regex-based quote detection (simpler, faster)
-
ML-based detection (more accurate, more complex)
-
Library-based parsing (e.g., email-reply-parser)
-
-
Recommendation: Implement dual storage (body_full + body_effective) with quote detection during indexing. Default LLM queries use body_effective; full text remains accessible for verification. Defer decision on detection method (regex vs. ML) until implementation phase. This deserves thorough testing with real Maven mailing list data to tune detection accuracy.
-
Future consideration: May warrant separate ADR-0002 once implementation details are resolved
-
-
Cross-posting and multi-list deduplication: How should we handle messages sent to multiple lists simultaneously?
-
Context: Starting with dev@maven.apache.org only, but eventually will expand to users@, announce@, etc.
-
Problem: Messages are often cross-posted to multiple Apache mailing lists:
-
Announcements sent to dev@, users@, announce@
-
Important discussions cross-posted between dev@ and users@
-
Same Message-ID delivered to multiple lists
-
Currently would result in duplicate storage and indexing
-
-
Storage implications:
-
Without deduplication: Same message stored N times (once per list)
-
With deduplication: Store once, track which lists received it
-
Estimated 10-20% storage reduction for cross-posted messages
-
-
Proposed approach (for future consideration):
-
Store one message per unique Message-ID
-
Track lists as array:
lists: ["dev@", "users@", "announce@"] -
Preserve list-specific metadata (List-Id, archive URLs, received dates)
-
Filter by list during queries:
lists CONTAINS "dev@maven.apache.org"
-
-
Data model example:
-
Elasticsearch:
{message_id: "…", lists: […], list_metadata: {…}} -
Neo4j:
(msg:Message)-[:SENT_TO]→(list:MailingList)
-
-
Benefits:
-
Storage and indexing efficiency
-
Consistency across list views
-
Cross-list analytics capability ("messages on both dev@ and users@")
-
Thread integrity across lists
-
-
Challenges:
-
List-specific headers differ per delivery
-
Threading context may differ per list
-
Detection requires matching Message-ID across mbox files
-
-
Implementation notes:
-
During indexing, check if Message-ID exists before creating new entry
-
If exists, append list to existing message’s lists array
-
Store list-specific metadata separately per list
-
-
Recommendation: Defer until expanding beyond dev@ list. Initial implementation focuses on dev@maven.apache.org only. Design storage schema to accommodate
listsarray for future expansion. Revisit when adding second mailing list (users@ or announce@).
-