mcp-context-forge icon indicating copy to clipboard operation
mcp-context-forge copied to clipboard

[Epic] Billing and Metering Plugin with Guaranteed Message Delivery

Open jonpspri opened this issue 1 month ago • 0 comments

🧭 Type of Feature

Please select the most appropriate category:

  • [ ] Enhancement to existing functionality
  • [x] New feature or capability
  • [ ] New MCP-compliant server
  • [x] New component or integration
  • [ ] Developer tooling or test improvement
  • [ ] Packaging, automation and deployment (ex: pypi, docker, quay.io, kubernetes, terraform)
  • [ ] Other (please describe below)

🧭 Epic

Title: Billing and Metering Plugin with Guaranteed Message Delivery

Goal: Provide a plugin framework and reference implementations for delivering transaction metadata to metering and billing systems with guaranteed delivery semantics. This enables independent management of customer wallets, usage tracking, and billing without relying on application logs or observability systems.

Why now:

  • Enterprise customers require accurate, auditable usage tracking for billing purposes
  • Guaranteed delivery is critical for financial accuracy and compliance
  • Current logging and observability systems are not designed for financial record-keeping
  • Need to integrate with existing enterprise message queue infrastructure (Kafka, RabbitMQ, etc.)
  • Production deployments require separation of concerns between observability and billing
  • Supports multi-tenant billing models and pay-per-use scenarios

🙋♂️ User Story 1: Transaction Delivery to Billing System

As a: Gateway administrator I want: Every tool invocation, resource access, and prompt execution to be reliably delivered to our billing system So that: We can accurately track usage, charge customers, and maintain financial audit trails

✅ Acceptance Criteria

Scenario: Tool invocation generates billing event
  Given the billing plugin is enabled and configured
  When an authenticated user invokes a tool via the gateway
  Then a billing event is created with transaction metadata
  And the event is published to the configured message queue
  And the gateway confirms message delivery before completing the request
  And the billing event includes: user identity, tool name, parameters, timestamp, resource consumption

Scenario: Failed delivery triggers retry with backoff
  Given a billing event fails to deliver to the message queue
  When the delivery fails due to network issues or queue unavailability
  Then the event is queued in a persistent local buffer
  And retry attempts occur with exponential backoff
  And the gateway continues processing requests (degraded mode)
  And administrators receive alerts about delivery failures
  And buffered events are delivered when the queue becomes available

Scenario: Idempotent delivery prevents duplicate billing
  Given a billing event is successfully delivered to the queue
  When a network timeout causes the gateway to retry delivery
  Then the message queue deduplicates the event using transaction ID
  And the customer is billed only once for the transaction
  And the billing system logs the deduplicated attempt for audit

🙋♂️ User Story 2: Multi-Tenant Usage Tracking

As a: Billing administrator I want: Detailed usage metadata for each tenant and workspace So that: I can implement per-tenant quotas, billing tiers, and usage analytics

✅ Acceptance Criteria

Scenario: Billing event includes tenant context
  Given a user belongs to tenant "acme-corp" and workspace "engineering"
  When the user invokes a tool or accesses a resource
  Then the billing event includes tenant_id and workspace_id
  And the event includes cost allocation tags
  And the billing system can aggregate usage by tenant, workspace, or user

Scenario: Resource consumption metrics included
  Given a tool invocation consumes tokens, compute time, or external API calls
  When the billing event is generated
  Then it includes quantifiable resource metrics
  And metrics are categorized by type (tokens, seconds, API calls, data transfer)
  And the billing system can apply variable pricing based on resource type

Scenario: Configurable event detail levels
  Given different billing use cases require different detail levels
  When administrators configure the billing plugin
  Then they can select detail levels: minimal, standard, or comprehensive
  And minimal includes only user, tool, timestamp, and cost
  And comprehensive includes full request/response payloads and all metadata
  And detail level can be overridden per tool or resource

🙋♂️ User Story 3: Message Queue Integration

As a: DevOps engineer I want: To integrate the gateway with our existing message queue infrastructure So that: Billing events flow into our established financial systems without custom development

✅ Acceptance Criteria

Scenario: Kafka integration with guaranteed delivery
  Given our billing system consumes events from Kafka
  When I configure the billing plugin with Kafka connection details
  Then billing events are published to the specified Kafka topic
  And the plugin uses Kafka's acknowledgment mechanism for delivery confirmation
  And the plugin supports SASL/SSL authentication
  And the plugin handles Kafka broker failures with automatic failover

Scenario: RabbitMQ integration with persistent queues
  Given our billing system uses RabbitMQ for message processing
  When I configure the billing plugin with RabbitMQ connection details
  Then billing events are published to a durable RabbitMQ queue
  And messages are marked as persistent for disk-based durability
  And the plugin confirms message persistence before acknowledging
  And dead-letter queues capture failed message processing

Scenario: Multiple queue backends supported
  Given different deployment environments use different message queues
  When I deploy the gateway in different environments
  Then I can configure Kafka, RabbitMQ, ZeroMQ, or IBM MQ Series
  And the plugin provides a consistent interface across all queue types
  And I can switch queue backends via configuration without code changes

🙋♂️ User Story 4: Reliability and Audit Compliance

As a: Compliance officer I want: Guaranteed, auditable delivery of all billable transactions So that: We meet financial regulations and can prove accurate billing in audits

✅ Acceptance Criteria

Scenario: Local persistence before external delivery
  Given reliability is critical for billing accuracy
  When a billing event is generated
  Then it is first written to a local persistent store (SQLite, disk log)
  And the event is marked as "pending delivery"
  And delivery to the external queue is attempted
  And only after confirmed delivery is the event marked "delivered"
  And the local store is periodically cleaned of delivered events

Scenario: Transaction IDs enable end-to-end tracking
  Given billing events must be traceable across systems
  When a tool invocation occurs
  Then a unique transaction_id is generated
  And the transaction_id is included in the billing event
  And the transaction_id is logged in application logs
  And the transaction_id is returned to the client in response metadata
  And billing system can correlate events with application activity

Scenario: Delivery metrics and alerting
  Given operations teams need visibility into billing delivery health
  When the billing plugin is active
  Then it exposes Prometheus metrics for delivery rates, failures, and latency
  And administrators can set alerts on delivery failure thresholds
  And the admin UI shows real-time billing event delivery status
  And the system generates daily delivery reports for compliance review

📐 Design Sketch

sequenceDiagram
    participant Client as AI Client
    participant GW as MCP Gateway
    participant Plugin as Billing Plugin
    participant Buffer as Local Buffer<br/>(SQLite/Disk)
    participant Queue as Message Queue<br/>(Kafka/RabbitMQ)
    participant Billing as Billing System

    Note over Client,GW: Request Phase
    Client->>GW: POST /tools/call (invoke_tool)
    GW->>GW: Process tool invocation

    Note over GW,Plugin: Billing Event Generation
    GW->>Plugin: post_request_hook(transaction_metadata)
    Plugin->>Plugin: Generate billing event<br/>(user, tool, cost, metrics)

    Note over Plugin,Buffer: Reliability Layer
    Plugin->>Buffer: Persist event (txn_id, event_data)
    Buffer-->>Plugin: Confirmed persisted

    Note over Plugin,Queue: Guaranteed Delivery
    Plugin->>Queue: Publish event (with txn_id)
    alt Queue Available
        Queue-->>Plugin: Delivery confirmed
        Plugin->>Buffer: Mark event as delivered
    else Queue Unavailable
        Queue--xPlugin: Delivery failed
        Plugin->>Plugin: Schedule retry<br/>(exponential backoff)
        Plugin-->>GW: Degraded mode warning
    end

    Plugin-->>GW: Billing event handled
    GW-->>Client: Tool result (with txn_id)

    Note over Queue,Billing: Billing Processing
    Queue->>Billing: Consume billing event
    Billing->>Billing: Process usage & update wallet
    Billing-->>Queue: Ack message

📐 Billing Event Schema

{
  "transaction_id": "txn_abc123xyz789",
  "event_type": "tool_invocation",
  "timestamp": "2025-10-20T14:30:45.123Z",
  "tenant": {
    "tenant_id": "tenant_acme_corp",
    "workspace_id": "workspace_engineering",
    "user_id": "user_john_doe",
    "user_email": "[email protected]"
  },
  "resource": {
    "type": "tool",
    "name": "pdf_converter",
    "server_id": "server_document_processing",
    "gateway_id": "gateway_us_east_1"
  },
  "consumption": {
    "execution_time_ms": 1250,
    "tokens_consumed": 0,
    "api_calls": 1,
    "data_transfer_bytes": 245678
  },
  "cost": {
    "amount": 0.05,
    "currency": "USD",
    "pricing_tier": "standard"
  },
  "metadata": {
    "request_id": "req_456def",
    "client_ip": "203.0.113.42",
    "user_agent": "Claude-Desktop/1.2.3",
    "success": true,
    "error_code": null
  },
  "detail_level": "standard",
  "schema_version": "1.0"
}

📐 Plugin Configuration

# plugins/config.yaml
plugins:
  - name: billing_delivery
    enabled: true
    priority: 100
    config:
      # Queue Backend Configuration
      backend: kafka  # kafka | rabbitmq | zeromq | ibm_mq | webhook

      # Kafka Configuration
      kafka:
        bootstrap_servers: "kafka-1.example.com:9092,kafka-2.example.com:9092"
        topic: "mcpgateway.billing.events"
        security_protocol: SASL_SSL
        sasl_mechanism: PLAIN
        sasl_username: "${KAFKA_USERNAME}"
        sasl_password: "${KAFKA_PASSWORD}"
        acks: all  # Wait for all in-sync replicas
        compression_type: gzip

      # RabbitMQ Configuration (alternative)
      rabbitmq:
        host: "rabbitmq.example.com"
        port: 5672
        virtual_host: "/billing"
        queue: "mcpgateway.billing.events"
        durable: true
        username: "${RABBITMQ_USERNAME}"
        password: "${RABBITMQ_PASSWORD}"

      # Reliability Settings
      reliability:
        local_buffer_enabled: true
        local_buffer_path: "/var/lib/mcpgateway/billing_buffer.db"
        retry_max_attempts: 10
        retry_initial_delay_ms: 1000
        retry_max_delay_ms: 300000
        retry_exponential_base: 2

      # Event Configuration
      events:
        detail_level: standard  # minimal | standard | comprehensive
        include_request_body: false
        include_response_body: false
        include_headers: false

      # Cost Calculation (optional - can be done in billing system)
      pricing:
        enabled: true
        default_tool_cost: 0.01
        default_resource_cost: 0.005
        default_prompt_cost: 0.001

      # Monitoring
      metrics:
        enabled: true
        prometheus_namespace: mcpgateway_billing

🔗 MCP Standards Check

  • [x] Change adheres to current MCP specifications
  • [x] No breaking changes to existing MCP-compliant integrations
  • [ ] If deviations exist, please describe them below:

Standards Compliance:

  • Uses existing plugin framework (ADR-016) for implementation
  • Leverages existing plugin hooks (pre/post request/response)
  • Does not modify MCP protocol or wire format
  • Transparent to MCP clients - billing happens in gateway middleware
  • Compatible with all MCP transports (HTTP, SSE, WebSocket, stdio)

Implementation Notes:

  • Plugin operates independently of MCP protocol flow
  • Billing events are generated from transaction metadata, not MCP messages directly
  • Plugin does not interfere with MCP request/response semantics
  • Delivery failures do not block or fail client requests (degraded mode)

🔄 Alternatives Considered

  1. Use Existing Observability Logs for Billing

    • Rejected: Logs are not designed for guaranteed delivery
    • Logs may be rotated, truncated, or lost
    • No delivery confirmation or retry mechanisms
    • Observability and financial tracking have different reliability requirements
    • Epic explicitly states: "independent of support/management log delivery"
  2. Database-Based Billing Event Storage

    • Considered: Store billing events in PostgreSQL and poll for export
    • Limitation: Adds latency and complexity to transaction processing
    • Limitation: Database becomes SPOF for billing accuracy
    • Decision: Use DB as local buffer only, not primary delivery mechanism
  3. Synchronous Billing API Calls

    • Rejected: Adds latency to every request
    • Billing system unavailability would block gateway operations
    • No natural retry or buffering mechanism
    • Violates separation of concerns
  4. Custom Message Protocol (Non-Standard Queues)

    • Rejected: Enterprises already have message queue infrastructure
    • Custom protocol requires additional integration work
    • Standard queues (Kafka, RabbitMQ) provide proven guarantees
    • Epic explicitly mentions: "kafka, rabbitmq, zeromq, MQSeries"
  5. Webhook-Based Event Delivery

    • Considered: HTTP POST to billing service endpoint
    • Limitation: Requires custom retry logic
    • Limitation: No standard queuing or backpressure handling
    • Decision: Support as an option, but queues are preferred

📓 Additional Context

Independence from Logging and Observability:

The Epic emphasizes: "Because of the required delivery component implicit in billing, this implementation should be independent of support/management log delivery and performance analytics. Reliability is the key component."

This means:

  • Separate Infrastructure: Billing events use dedicated message queues, not log aggregators
  • Separate Configuration: Billing plugin config is independent of logging config
  • Separate Reliability Guarantees: Billing requires stronger guarantees than logs
  • Separate Monitoring: Billing delivery metrics separate from application metrics
  • Separate Access Controls: Billing data may have different security requirements

Message Queue Selection Rationale:

The Epic lists several queue systems. Selection criteria:

Queue System Use Case Strengths
Kafka High-throughput, cloud-native Distributed, scalable, event streaming
RabbitMQ Traditional enterprise messaging AMQP standard, flexible routing, durable
ZeroMQ Low-latency, embedded scenarios Lightweight, no broker, TCP sockets
IBM MQ Series Legacy enterprise systems Proven reliability, mainframe integration

Guaranteed Delivery Mechanics:

  1. Local Persistence: Write-ahead log to SQLite before external delivery
  2. Acknowledgment: Wait for queue broker confirmation
  3. Retry with Backoff: Exponential backoff on failure (1s → 2s → 4s → ... → 5min)
  4. Idempotency: Use transaction IDs to prevent duplicate billing
  5. Dead Letter Queues: Failed messages after max retries go to DLQ for investigation
  6. Monitoring: Prometheus metrics expose delivery health

Cost Calculation Options:

Option 1: Gateway-Side Pricing (Simpler)

  • Plugin includes pricing configuration
  • Cost calculated at event generation time
  • Simpler billing system integration
  • Limitation: Pricing changes require gateway config update

Option 2: Billing-System-Side Pricing (Flexible)

  • Gateway sends raw usage metrics only
  • Billing system applies pricing rules
  • Enables dynamic pricing, A/B testing, discounts
  • Recommended approach for complex pricing models

Related Existing Features:

  • Plugin framework (ADR-016) - foundation for billing plugin
  • Metrics system - separate from billing, used for observability
  • RBAC system - user/tenant identity for billing attribution
  • Transaction metadata tracking - basis for billing events

Configuration Variables (new):

# Billing Plugin
MCPGATEWAY_BILLING_PLUGIN_ENABLED=true
MCPGATEWAY_BILLING_BACKEND=kafka
MCPGATEWAY_BILLING_KAFKA_BOOTSTRAP_SERVERS=kafka-1:9092,kafka-2:9092
MCPGATEWAY_BILLING_KAFKA_TOPIC=mcpgateway.billing.events
MCPGATEWAY_BILLING_DETAIL_LEVEL=standard
MCPGATEWAY_BILLING_LOCAL_BUFFER_ENABLED=true
MCPGATEWAY_BILLING_LOCAL_BUFFER_PATH=/var/lib/mcpgateway/billing.db
MCPGATEWAY_BILLING_RETRY_MAX_ATTEMPTS=10
MCPGATEWAY_BILLING_METRICS_ENABLED=true

Security Considerations:

  • Credential Management: Queue credentials stored in environment variables or vault
  • Encryption: TLS/SSL for queue connections
  • Access Control: RBAC for billing plugin configuration
  • Data Privacy: Configurable event detail levels (exclude sensitive payloads)
  • Audit Trail: Billing events themselves serve as audit log

Testing Strategy:

  • Unit Tests: Event generation, serialization, local persistence
  • Integration Tests: Kafka/RabbitMQ delivery with testcontainers
  • Reliability Tests: Simulate queue failures, verify retry and buffering
  • Performance Tests: High-volume event generation (10K events/sec)
  • Idempotency Tests: Verify deduplication with duplicate transaction IDs
  • Compliance Tests: Verify all billable transactions generate events

Implementation Phases:

Phase 1: Core Plugin Framework

  • Billing event schema and generation
  • Post-request hook integration
  • Local SQLite buffer for persistence
  • Basic Kafka integration

Phase 2: Reliability Features

  • Retry mechanism with exponential backoff
  • Delivery confirmation tracking
  • Prometheus metrics for monitoring
  • Admin UI for delivery status

Phase 3: Multi-Backend Support

  • RabbitMQ integration
  • ZeroMQ integration
  • IBM MQ Series integration
  • Webhook fallback option

Phase 4: Advanced Features

  • Configurable event detail levels
  • Gateway-side pricing (optional)
  • Dead letter queue handling
  • Multi-tenant billing isolation

Documentation Requirements:

  • Plugin installation and configuration guide
  • Message queue setup tutorials (Kafka, RabbitMQ)
  • Billing event schema reference
  • Reliability guarantees and SLAs
  • Integration guide for billing systems
  • Troubleshooting guide for delivery failures
  • Compliance and audit documentation

Compliance Standards:

  • SOX (Sarbanes-Oxley): Accurate financial record-keeping
  • PCI DSS: If processing payment card data
  • GDPR: If billing includes personal data (ensure data minimization)
  • ISO 27001: Information security for financial data

jonpspri avatar Oct 20 '25 10:10 jonpspri