ADR-0005: Smart Update Strategy for Mailing List Archives
Context
The scheduler runs hourly to keep the mailing list archive index up-to-date. The original implementation always fetched and re-indexed the current month’s mbox file, regardless of whether new emails had arrived.
This approach had two problems:
-
Wasted bandwidth: Fetching ~1-5 MB mbox files hourly even when no new emails arrived
-
Month transition race condition: When the month changes (e.g., December → January), late-arriving emails from the previous month could be missed because the scheduler would only fetch the new current month
Decision
Implement a smart update strategy that:
-
Uses the Apache Lists
stats.luaAPI to get expected message counts per month -
Compares expected counts with indexed document counts in Elasticsearch
-
Only fetches mbox files when counts differ
-
Marks past months as "complete" when counts match, skipping them in future runs
Algorithm
For each configured mailing list:
1. Fetch stats from https://lists.apache.org/api/stats.lua
→ Get expected message counts for current + previous month
2. For PREVIOUS month:
├─ Is marked complete in metadata? → SKIP
├─ Query Elasticsearch for indexed count
├─ Counts match? → Mark complete, SKIP fetch
└─ Counts differ? → FETCH mbox, index, then mark complete
3. For CURRENT month:
├─ Query Elasticsearch for indexed count
├─ Counts match? → Log "up to date" (never mark complete)
└─ Counts differ? → FETCH mbox and index
Consequences
Positive
-
Bandwidth efficient: Only fetches when new emails exist
-
No missed emails: Previous month is always checked until marked complete
-
Fast execution: Completed months are skipped entirely (no Elasticsearch count query needed)
-
Self-healing: If counts mismatch due to indexing errors, re-fetching occurs automatically
Alternatives Considered
A: Always Fetch Current + Previous Month
Simple brute-force approach: always fetch both months regardless of changes.
- Rejected because
-
-
Wastes bandwidth for 29+ days per month (when previous month hasn’t changed)
-
Still requires full mbox download even when no changes
-
Implementation
New Components
src/mail_mcp/api/apache_lists.py-
Client for Apache Lists stats API (
stats.luaendpoint)
References
-
Apache Pony Mail - Powers lists.apache.org
-
Pony Mail Foal - Current implementation