mcp-context-forge
mcp-context-forge copied to clipboard
[Epic] Billing and Metering Plugin with Guaranteed Message Delivery
🧭 Type of Feature
Please select the most appropriate category:
- [ ] Enhancement to existing functionality
- [x] New feature or capability
- [ ] New MCP-compliant server
- [x] New component or integration
- [ ] Developer tooling or test improvement
- [ ] Packaging, automation and deployment (ex: pypi, docker, quay.io, kubernetes, terraform)
- [ ] Other (please describe below)
🧭 Epic
Title: Billing and Metering Plugin with Guaranteed Message Delivery
Goal: Provide a plugin framework and reference implementations for delivering transaction metadata to metering and billing systems with guaranteed delivery semantics. This enables independent management of customer wallets, usage tracking, and billing without relying on application logs or observability systems.
Why now:
- Enterprise customers require accurate, auditable usage tracking for billing purposes
- Guaranteed delivery is critical for financial accuracy and compliance
- Current logging and observability systems are not designed for financial record-keeping
- Need to integrate with existing enterprise message queue infrastructure (Kafka, RabbitMQ, etc.)
- Production deployments require separation of concerns between observability and billing
- Supports multi-tenant billing models and pay-per-use scenarios
🙋♂️ User Story 1: Transaction Delivery to Billing System
As a: Gateway administrator I want: Every tool invocation, resource access, and prompt execution to be reliably delivered to our billing system So that: We can accurately track usage, charge customers, and maintain financial audit trails
✅ Acceptance Criteria
Scenario: Tool invocation generates billing event
Given the billing plugin is enabled and configured
When an authenticated user invokes a tool via the gateway
Then a billing event is created with transaction metadata
And the event is published to the configured message queue
And the gateway confirms message delivery before completing the request
And the billing event includes: user identity, tool name, parameters, timestamp, resource consumption
Scenario: Failed delivery triggers retry with backoff
Given a billing event fails to deliver to the message queue
When the delivery fails due to network issues or queue unavailability
Then the event is queued in a persistent local buffer
And retry attempts occur with exponential backoff
And the gateway continues processing requests (degraded mode)
And administrators receive alerts about delivery failures
And buffered events are delivered when the queue becomes available
Scenario: Idempotent delivery prevents duplicate billing
Given a billing event is successfully delivered to the queue
When a network timeout causes the gateway to retry delivery
Then the message queue deduplicates the event using transaction ID
And the customer is billed only once for the transaction
And the billing system logs the deduplicated attempt for audit
🙋♂️ User Story 2: Multi-Tenant Usage Tracking
As a: Billing administrator I want: Detailed usage metadata for each tenant and workspace So that: I can implement per-tenant quotas, billing tiers, and usage analytics
✅ Acceptance Criteria
Scenario: Billing event includes tenant context
Given a user belongs to tenant "acme-corp" and workspace "engineering"
When the user invokes a tool or accesses a resource
Then the billing event includes tenant_id and workspace_id
And the event includes cost allocation tags
And the billing system can aggregate usage by tenant, workspace, or user
Scenario: Resource consumption metrics included
Given a tool invocation consumes tokens, compute time, or external API calls
When the billing event is generated
Then it includes quantifiable resource metrics
And metrics are categorized by type (tokens, seconds, API calls, data transfer)
And the billing system can apply variable pricing based on resource type
Scenario: Configurable event detail levels
Given different billing use cases require different detail levels
When administrators configure the billing plugin
Then they can select detail levels: minimal, standard, or comprehensive
And minimal includes only user, tool, timestamp, and cost
And comprehensive includes full request/response payloads and all metadata
And detail level can be overridden per tool or resource
🙋♂️ User Story 3: Message Queue Integration
As a: DevOps engineer I want: To integrate the gateway with our existing message queue infrastructure So that: Billing events flow into our established financial systems without custom development
✅ Acceptance Criteria
Scenario: Kafka integration with guaranteed delivery
Given our billing system consumes events from Kafka
When I configure the billing plugin with Kafka connection details
Then billing events are published to the specified Kafka topic
And the plugin uses Kafka's acknowledgment mechanism for delivery confirmation
And the plugin supports SASL/SSL authentication
And the plugin handles Kafka broker failures with automatic failover
Scenario: RabbitMQ integration with persistent queues
Given our billing system uses RabbitMQ for message processing
When I configure the billing plugin with RabbitMQ connection details
Then billing events are published to a durable RabbitMQ queue
And messages are marked as persistent for disk-based durability
And the plugin confirms message persistence before acknowledging
And dead-letter queues capture failed message processing
Scenario: Multiple queue backends supported
Given different deployment environments use different message queues
When I deploy the gateway in different environments
Then I can configure Kafka, RabbitMQ, ZeroMQ, or IBM MQ Series
And the plugin provides a consistent interface across all queue types
And I can switch queue backends via configuration without code changes
🙋♂️ User Story 4: Reliability and Audit Compliance
As a: Compliance officer I want: Guaranteed, auditable delivery of all billable transactions So that: We meet financial regulations and can prove accurate billing in audits
✅ Acceptance Criteria
Scenario: Local persistence before external delivery
Given reliability is critical for billing accuracy
When a billing event is generated
Then it is first written to a local persistent store (SQLite, disk log)
And the event is marked as "pending delivery"
And delivery to the external queue is attempted
And only after confirmed delivery is the event marked "delivered"
And the local store is periodically cleaned of delivered events
Scenario: Transaction IDs enable end-to-end tracking
Given billing events must be traceable across systems
When a tool invocation occurs
Then a unique transaction_id is generated
And the transaction_id is included in the billing event
And the transaction_id is logged in application logs
And the transaction_id is returned to the client in response metadata
And billing system can correlate events with application activity
Scenario: Delivery metrics and alerting
Given operations teams need visibility into billing delivery health
When the billing plugin is active
Then it exposes Prometheus metrics for delivery rates, failures, and latency
And administrators can set alerts on delivery failure thresholds
And the admin UI shows real-time billing event delivery status
And the system generates daily delivery reports for compliance review
📐 Design Sketch
sequenceDiagram
participant Client as AI Client
participant GW as MCP Gateway
participant Plugin as Billing Plugin
participant Buffer as Local Buffer<br/>(SQLite/Disk)
participant Queue as Message Queue<br/>(Kafka/RabbitMQ)
participant Billing as Billing System
Note over Client,GW: Request Phase
Client->>GW: POST /tools/call (invoke_tool)
GW->>GW: Process tool invocation
Note over GW,Plugin: Billing Event Generation
GW->>Plugin: post_request_hook(transaction_metadata)
Plugin->>Plugin: Generate billing event<br/>(user, tool, cost, metrics)
Note over Plugin,Buffer: Reliability Layer
Plugin->>Buffer: Persist event (txn_id, event_data)
Buffer-->>Plugin: Confirmed persisted
Note over Plugin,Queue: Guaranteed Delivery
Plugin->>Queue: Publish event (with txn_id)
alt Queue Available
Queue-->>Plugin: Delivery confirmed
Plugin->>Buffer: Mark event as delivered
else Queue Unavailable
Queue--xPlugin: Delivery failed
Plugin->>Plugin: Schedule retry<br/>(exponential backoff)
Plugin-->>GW: Degraded mode warning
end
Plugin-->>GW: Billing event handled
GW-->>Client: Tool result (with txn_id)
Note over Queue,Billing: Billing Processing
Queue->>Billing: Consume billing event
Billing->>Billing: Process usage & update wallet
Billing-->>Queue: Ack message
📐 Billing Event Schema
{
"transaction_id": "txn_abc123xyz789",
"event_type": "tool_invocation",
"timestamp": "2025-10-20T14:30:45.123Z",
"tenant": {
"tenant_id": "tenant_acme_corp",
"workspace_id": "workspace_engineering",
"user_id": "user_john_doe",
"user_email": "[email protected]"
},
"resource": {
"type": "tool",
"name": "pdf_converter",
"server_id": "server_document_processing",
"gateway_id": "gateway_us_east_1"
},
"consumption": {
"execution_time_ms": 1250,
"tokens_consumed": 0,
"api_calls": 1,
"data_transfer_bytes": 245678
},
"cost": {
"amount": 0.05,
"currency": "USD",
"pricing_tier": "standard"
},
"metadata": {
"request_id": "req_456def",
"client_ip": "203.0.113.42",
"user_agent": "Claude-Desktop/1.2.3",
"success": true,
"error_code": null
},
"detail_level": "standard",
"schema_version": "1.0"
}
📐 Plugin Configuration
# plugins/config.yaml
plugins:
- name: billing_delivery
enabled: true
priority: 100
config:
# Queue Backend Configuration
backend: kafka # kafka | rabbitmq | zeromq | ibm_mq | webhook
# Kafka Configuration
kafka:
bootstrap_servers: "kafka-1.example.com:9092,kafka-2.example.com:9092"
topic: "mcpgateway.billing.events"
security_protocol: SASL_SSL
sasl_mechanism: PLAIN
sasl_username: "${KAFKA_USERNAME}"
sasl_password: "${KAFKA_PASSWORD}"
acks: all # Wait for all in-sync replicas
compression_type: gzip
# RabbitMQ Configuration (alternative)
rabbitmq:
host: "rabbitmq.example.com"
port: 5672
virtual_host: "/billing"
queue: "mcpgateway.billing.events"
durable: true
username: "${RABBITMQ_USERNAME}"
password: "${RABBITMQ_PASSWORD}"
# Reliability Settings
reliability:
local_buffer_enabled: true
local_buffer_path: "/var/lib/mcpgateway/billing_buffer.db"
retry_max_attempts: 10
retry_initial_delay_ms: 1000
retry_max_delay_ms: 300000
retry_exponential_base: 2
# Event Configuration
events:
detail_level: standard # minimal | standard | comprehensive
include_request_body: false
include_response_body: false
include_headers: false
# Cost Calculation (optional - can be done in billing system)
pricing:
enabled: true
default_tool_cost: 0.01
default_resource_cost: 0.005
default_prompt_cost: 0.001
# Monitoring
metrics:
enabled: true
prometheus_namespace: mcpgateway_billing
🔗 MCP Standards Check
- [x] Change adheres to current MCP specifications
- [x] No breaking changes to existing MCP-compliant integrations
- [ ] If deviations exist, please describe them below:
Standards Compliance:
- Uses existing plugin framework (ADR-016) for implementation
- Leverages existing plugin hooks (pre/post request/response)
- Does not modify MCP protocol or wire format
- Transparent to MCP clients - billing happens in gateway middleware
- Compatible with all MCP transports (HTTP, SSE, WebSocket, stdio)
Implementation Notes:
- Plugin operates independently of MCP protocol flow
- Billing events are generated from transaction metadata, not MCP messages directly
- Plugin does not interfere with MCP request/response semantics
- Delivery failures do not block or fail client requests (degraded mode)
🔄 Alternatives Considered
-
Use Existing Observability Logs for Billing
- Rejected: Logs are not designed for guaranteed delivery
- Logs may be rotated, truncated, or lost
- No delivery confirmation or retry mechanisms
- Observability and financial tracking have different reliability requirements
- Epic explicitly states: "independent of support/management log delivery"
-
Database-Based Billing Event Storage
- Considered: Store billing events in PostgreSQL and poll for export
- Limitation: Adds latency and complexity to transaction processing
- Limitation: Database becomes SPOF for billing accuracy
- Decision: Use DB as local buffer only, not primary delivery mechanism
-
Synchronous Billing API Calls
- Rejected: Adds latency to every request
- Billing system unavailability would block gateway operations
- No natural retry or buffering mechanism
- Violates separation of concerns
-
Custom Message Protocol (Non-Standard Queues)
- Rejected: Enterprises already have message queue infrastructure
- Custom protocol requires additional integration work
- Standard queues (Kafka, RabbitMQ) provide proven guarantees
- Epic explicitly mentions: "kafka, rabbitmq, zeromq, MQSeries"
-
Webhook-Based Event Delivery
- Considered: HTTP POST to billing service endpoint
- Limitation: Requires custom retry logic
- Limitation: No standard queuing or backpressure handling
- Decision: Support as an option, but queues are preferred
📓 Additional Context
Independence from Logging and Observability:
The Epic emphasizes: "Because of the required delivery component implicit in billing, this implementation should be independent of support/management log delivery and performance analytics. Reliability is the key component."
This means:
- Separate Infrastructure: Billing events use dedicated message queues, not log aggregators
- Separate Configuration: Billing plugin config is independent of logging config
- Separate Reliability Guarantees: Billing requires stronger guarantees than logs
- Separate Monitoring: Billing delivery metrics separate from application metrics
- Separate Access Controls: Billing data may have different security requirements
Message Queue Selection Rationale:
The Epic lists several queue systems. Selection criteria:
| Queue System | Use Case | Strengths |
|---|---|---|
| Kafka | High-throughput, cloud-native | Distributed, scalable, event streaming |
| RabbitMQ | Traditional enterprise messaging | AMQP standard, flexible routing, durable |
| ZeroMQ | Low-latency, embedded scenarios | Lightweight, no broker, TCP sockets |
| IBM MQ Series | Legacy enterprise systems | Proven reliability, mainframe integration |
Guaranteed Delivery Mechanics:
- Local Persistence: Write-ahead log to SQLite before external delivery
- Acknowledgment: Wait for queue broker confirmation
- Retry with Backoff: Exponential backoff on failure (1s → 2s → 4s → ... → 5min)
- Idempotency: Use transaction IDs to prevent duplicate billing
- Dead Letter Queues: Failed messages after max retries go to DLQ for investigation
- Monitoring: Prometheus metrics expose delivery health
Cost Calculation Options:
Option 1: Gateway-Side Pricing (Simpler)
- Plugin includes pricing configuration
- Cost calculated at event generation time
- Simpler billing system integration
- Limitation: Pricing changes require gateway config update
Option 2: Billing-System-Side Pricing (Flexible)
- Gateway sends raw usage metrics only
- Billing system applies pricing rules
- Enables dynamic pricing, A/B testing, discounts
- Recommended approach for complex pricing models
Related Existing Features:
- Plugin framework (ADR-016) - foundation for billing plugin
- Metrics system - separate from billing, used for observability
- RBAC system - user/tenant identity for billing attribution
- Transaction metadata tracking - basis for billing events
Configuration Variables (new):
# Billing Plugin
MCPGATEWAY_BILLING_PLUGIN_ENABLED=true
MCPGATEWAY_BILLING_BACKEND=kafka
MCPGATEWAY_BILLING_KAFKA_BOOTSTRAP_SERVERS=kafka-1:9092,kafka-2:9092
MCPGATEWAY_BILLING_KAFKA_TOPIC=mcpgateway.billing.events
MCPGATEWAY_BILLING_DETAIL_LEVEL=standard
MCPGATEWAY_BILLING_LOCAL_BUFFER_ENABLED=true
MCPGATEWAY_BILLING_LOCAL_BUFFER_PATH=/var/lib/mcpgateway/billing.db
MCPGATEWAY_BILLING_RETRY_MAX_ATTEMPTS=10
MCPGATEWAY_BILLING_METRICS_ENABLED=true
Security Considerations:
- Credential Management: Queue credentials stored in environment variables or vault
- Encryption: TLS/SSL for queue connections
- Access Control: RBAC for billing plugin configuration
- Data Privacy: Configurable event detail levels (exclude sensitive payloads)
- Audit Trail: Billing events themselves serve as audit log
Testing Strategy:
- Unit Tests: Event generation, serialization, local persistence
- Integration Tests: Kafka/RabbitMQ delivery with testcontainers
- Reliability Tests: Simulate queue failures, verify retry and buffering
- Performance Tests: High-volume event generation (10K events/sec)
- Idempotency Tests: Verify deduplication with duplicate transaction IDs
- Compliance Tests: Verify all billable transactions generate events
Implementation Phases:
Phase 1: Core Plugin Framework
- Billing event schema and generation
- Post-request hook integration
- Local SQLite buffer for persistence
- Basic Kafka integration
Phase 2: Reliability Features
- Retry mechanism with exponential backoff
- Delivery confirmation tracking
- Prometheus metrics for monitoring
- Admin UI for delivery status
Phase 3: Multi-Backend Support
- RabbitMQ integration
- ZeroMQ integration
- IBM MQ Series integration
- Webhook fallback option
Phase 4: Advanced Features
- Configurable event detail levels
- Gateway-side pricing (optional)
- Dead letter queue handling
- Multi-tenant billing isolation
Documentation Requirements:
- Plugin installation and configuration guide
- Message queue setup tutorials (Kafka, RabbitMQ)
- Billing event schema reference
- Reliability guarantees and SLAs
- Integration guide for billing systems
- Troubleshooting guide for delivery failures
- Compliance and audit documentation
Compliance Standards:
- SOX (Sarbanes-Oxley): Accurate financial record-keeping
- PCI DSS: If processing payment card data
- GDPR: If billing includes personal data (ensure data minimization)
- ISO 27001: Information security for financial data