Analysis Scripts
This document describes the analysis scripts available in the scripts/ directory.
These scripts help analyze patterns, references, and behavior in Maven mailing list archives.
Overview
The scripts directory contains tools for:
-
Analyzing GitHub repository and PR/issue references in emails
-
Querying the MCP server programmatically for testing and debugging
-
Extracting statistics about mailing list discussions
Prerequisites
Before running any analysis scripts, ensure:
-
Elasticsearch is running:
docker compose up -d elasticsearch -
Email data is indexed:
# Index specific months poetry run index-mbox data/dev/2024-10.mbox # Or index multiple files poetry run index-mbox data/dev/*.mbox -
Poetry dependencies are installed:
poetry install
Available Scripts
analyze_github_refs.py
Analyzes GitHub repository and PR/issue references in Maven mailing list emails. This script queries Elasticsearch directly to extract and categorize references.
Purpose
-
Identify which Apache Maven GitHub repositories are most frequently discussed
-
Extract PR/issue numbers and correlate them with specific repositories
-
Find unmatched PR references that may need manual investigation
Usage
# Analyze last 90 days (default)
poetry run python scripts/analyze_github_refs.py
# Analyze last 30 days
poetry run python scripts/analyze_github_refs.py --days 30
# Use custom Elasticsearch URL
poetry run python scripts/analyze_github_refs.py --es-url http://elasticsearch:9200
# Query a different index
poetry run python scripts/analyze_github_refs.py --index maven-users
Output Example
The script produces output like:
GITHUB REPOSITORIES REFERENCED (Last 90 days)
================================================================================
apache/maven 45 issues/PRs
Examples: #1234, #1235, #1300, #1350, #1400
... and 37 more
apache/maven-compiler-plugin 12 issues/PRs
Examples: #100, #105, #110
... and 9 more
apache/maven-resolver 8 issues/PRs
Examples: #200, #205
Total matched: 65 PR/issue references to specific repositories
Total unmatched: 23 PR/issue numbers (no explicit repo in text)
How It Works
The script uses two patterns to extract GitHub references:
- GitHub URL pattern
-
Matches URLs like
https://github.com/apache/maven/issues/1234orhttps://github.com/apache/maven-compiler-plugin/pull/567. - Bracket notation
-
Matches references like
[maven-compiler-plugin#42]which is a common shorthand in Maven discussions.
References that only contain a PR number (like #123) without repository context are counted as "unmatched" since they could refer to any Maven repository.
query_via_mcp.py
Query the Maven mailing list archives via the MCP server. This script demonstrates how to use the MCP client programmatically.
Purpose
-
Test MCP server tools without a full LLM integration
-
Debug search results and verify tool behavior
-
Provide examples for MCP client usage
Usage
# Search emails by keyword
poetry run python scripts/query_via_mcp.py search "release 4.0"
# Get a specific message
poetry run python scripts/query_via_mcp.py message "<message-id@example.com>"
# Get thread containing a message
poetry run python scripts/query_via_mcp.py thread "<message-id@example.com>"
# Find emails from a contributor
poetry run python scripts/query_via_mcp.py contributor "john@example.com"
# Find emails mentioning a JIRA issue
poetry run python scripts/query_via_mcp.py jira "MNG-7891"
# Find emails mentioning a GitHub PR
poetry run python scripts/query_via_mcp.py github "1234"
Available Commands
- search <query>
-
Full-text search across email subjects and bodies. Supports the
--sizeoption to limit results (default: 5). - message <id>
-
Retrieve a specific email message by its Message-ID. Angle brackets are optional.
- thread <id>
-
Retrieve the entire email thread containing the specified message. Supports
--max-messagesoption (default: 20). - contributor <name>
-
Find emails from a specific contributor. Supports partial matching of email addresses and names.
- jira <issue>
-
Find emails that reference a JIRA issue (e.g.,
MNG-7891). - github <pr>
-
Find emails that reference a GitHub PR number.
Use Cases
Investigating a JIRA Issue
To understand the discussion history around a JIRA issue:
# Find all emails mentioning the issue
poetry run python scripts/query_via_mcp.py jira "MNG-7891"
# Get the full thread for a specific discussion
poetry run python scripts/query_via_mcp.py thread "<message-id-from-results>"
Understanding Contributor Activity
To see what a specific contributor has been discussing:
poetry run python scripts/query_via_mcp.py contributor "developer@apache.org" --size 20
Analyzing GitHub Activity
To see which Maven repositories are being actively discussed:
# Get overview of repository discussions in last 30 days
poetry run python scripts/analyze_github_refs.py --days 30
# Then find specific discussions about a PR
poetry run python scripts/query_via_mcp.py github "1234"
Debugging MCP Server Behavior
When developing or debugging the MCP server, use the query script to verify tool responses:
# Test search functionality
poetry run python scripts/query_via_mcp.py search "test query"
# Test message retrieval
poetry run python scripts/query_via_mcp.py message "<known-message-id>"
Extending the Scripts
Troubleshooting
Elasticsearch Connection Errors
If scripts fail to connect to Elasticsearch:
# Check if Elasticsearch is running
docker compose ps
# Start Elasticsearch if needed
docker compose up -d elasticsearch
# Verify it's accessible
curl http://localhost:59200