ADR-0005: Smart Update Strategy for Mailing List Archives

Status

Accepted

Context

The scheduler runs hourly to keep the mailing list archive index up-to-date. The original implementation always fetched and re-indexed the current month’s mbox file, regardless of whether new emails had arrived.

This approach had two problems:

  1. Wasted bandwidth: Fetching ~1-5 MB mbox files hourly even when no new emails arrived

  2. Month transition race condition: When the month changes (e.g., December → January), late-arriving emails from the previous month could be missed because the scheduler would only fetch the new current month

The Race Condition

Consider this scenario:

  1. Scheduler runs at 23:59 on December 31st, fetches December 2025

  2. Month changes to January 2026

  3. Late emails arrive for December 2025 (mail servers have delays)

  4. Next scheduler run fetches January 2026 only

  5. December 2025 emails are permanently missed

Decision

Implement a smart update strategy that:

  1. Uses the Apache Lists stats.lua API to get expected message counts per month

  2. Compares expected counts with indexed document counts in Elasticsearch

  3. Only fetches mbox files when counts differ

  4. Marks past months as "complete" when counts match, skipping them in future runs

Algorithm

For each configured mailing list:
  1. Fetch stats from https://lists.apache.org/api/stats.lua
     → Get expected message counts for current + previous month

  2. For PREVIOUS month:
     ├─ Is marked complete in metadata? → SKIP
     ├─ Query Elasticsearch for indexed count
     ├─ Counts match? → Mark complete, SKIP fetch
     └─ Counts differ? → FETCH mbox, index, then mark complete

  3. For CURRENT month:
     ├─ Query Elasticsearch for indexed count
     ├─ Counts match? → Log "up to date" (never mark complete)
     └─ Counts differ? → FETCH mbox and index

Metadata Storage

Completion markers are stored in a dedicated Elasticsearch index (maven-meta):

{
  "_id": "dev@maven.apache.org:2025-11",
  "list_name": "dev@maven.apache.org",
  "year_month": "2025-11",
  "complete": true,
  "expected_count": 247,
  "checked_at": "2025-12-21T10:00:00Z"
}

Consequences

Positive

  • Bandwidth efficient: Only fetches when new emails exist

  • No missed emails: Previous month is always checked until marked complete

  • Fast execution: Completed months are skipped entirely (no Elasticsearch count query needed)

  • Self-healing: If counts mismatch due to indexing errors, re-fetching occurs automatically

Negative

  • Additional API dependency: Relies on stats.lua endpoint availability

  • Slight complexity: More code paths than simple "always fetch" approach

  • Metadata index: Requires additional Elasticsearch index for tracking

Neutral

  • Stats API call is lightweight (~10KB response for full history)

  • One stats call per list per scheduler run (not per month)

Alternatives Considered

A: Always Fetch Current + Previous Month

Simple brute-force approach: always fetch both months regardless of changes.

Rejected because
  • Wastes bandwidth for 29+ days per month (when previous month hasn’t changed)

  • Still requires full mbox download even when no changes

B: Compare Stats Only (No Completion Markers)

Check stats API and compare counts, but don’t persist completion status.

Rejected because
  • Still requires Elasticsearch count query for every past month

  • Less efficient as archive grows (more months to check)

C: HTTP ETag/Last-Modified Headers

Use HTTP conditional requests to check if mbox content changed.

Rejected because
  • Apache Lists API doesn’t provide reliable ETag/Last-Modified headers for mbox files

  • Would still require fetching the full file to compare

Implementation

New Components

src/mail_mcp/api/apache_lists.py

Client for Apache Lists stats API (stats.lua endpoint)

Modified Components

src/mail_mcp/storage/elasticsearch.py

Added methods: count_by_month(), mark_month_complete(), is_month_complete()

src/mail_mcp/cli/update_current_month.py

Refactored to use smart update logic with stats comparison

API Endpoint Used

GET https://lists.apache.org/api/stats.lua?list=dev&domain=maven.apache.org

Response:
{
  "firstYear": 2002,
  "lastYear": 2025,
  "firstMonth": 7,
  "lastMonth": 12,
  "active_months": {
    "2025-11": 247,
    "2025-12": 84,
    ...
  }
}

References