fedify icon indicating copy to clipboard operation
fedify copied to clipboard

Comprehensive OpenTelemetry metrics collection

Open dahlia opened this issue 5 months ago • 3 comments

Summary

Add comprehensive OpenTelemetry metrics collection to complement the existing tracing capabilities. Currently, Fedify only collects spans for tracing, but lacks metrics for monitoring performance, throughput, error rates, and system health in production environments.

Problem

While Fedify provides excellent OpenTelemetry tracing support for debugging and understanding request flows, operators and developers lack crucial metrics for:

  1. Performance monitoring: Request latencies, throughput rates, and processing times
  2. Error tracking: Failure rates, retry attempts, and error types across different operations
  3. Resource utilization: Queue depths, cache hit rates, and concurrent request counts
  4. Federation health: Activity delivery success rates, peer connectivity, and signature verification metrics
  5. Capacity planning: Understanding bottlenecks and scaling requirements

This gap makes it difficult to:

  • Set up proper alerting for production issues
  • Identify performance bottlenecks proactively
  • Monitor the health of federation activities
  • Plan for capacity and scaling needs
  • Create comprehensive dashboards for operational visibility

Proposed Solution

Implement comprehensive OpenTelemetry metrics collection across key areas of Fedify operations:

1. HTTP & ActivityPub Request Metrics

  • fedify_http_requests_total (Counter): Total HTTP requests by method, route, status
  • fedify_http_request_duration_seconds (Histogram): Request processing latencies
  • fedify_activitypub_requests_total (Counter): ActivityPub operations by type and status

2. Federation Activity Metrics

  • fedify_inbox_activities_total (Counter): Received activities by type and processing status
  • fedify_outbox_activities_total (Counter): Sent activities by type and delivery status
  • fedify_activity_fanout_recipients (Histogram): Number of recipients per activity
  • fedify_activity_delivery_duration_seconds (Histogram): Activity delivery latencies
  • fedify_activity_delivery_retries_total (Counter): Retry attempts by activity type

3. Queue & Task Processing Metrics

  • fedify_queue_depth (Gauge): Current queue sizes for inbox/outbox processing
  • fedify_queue_processing_duration_seconds (Histogram): Task processing times
  • fedify_queue_tasks_total (Counter): Completed, failed, and retried tasks

4. Security & Authentication Metrics

  • fedify_http_signatures_total (Counter): HTTP signature operations and success rates
  • fedify_ld_signatures_total (Counter): Linked Data signature operations
  • fedify_object_integrity_proofs_total (Counter): Object integrity proof operations
  • fedify_signature_verification_duration_seconds (Histogram): Signature verification times

5. Caching & Key Management Metrics

  • fedify_key_lookups_total (Counter): Key lookups with cache hit/miss status
  • fedify_key_cache_size (Gauge): Number of cached keys
  • fedify_key_fetch_duration_seconds (Histogram): Remote key fetch latencies

6. WebFinger & Discovery Metrics

  • fedify_webfinger_lookups_total (Counter): WebFinger lookup requests and outcomes
  • fedify_actor_handle_lookups_total (Counter): Actor handle resolution attempts

7. Collection & Pagination Metrics

  • fedify_collection_items_total (Histogram): Items per collection by type
  • fedify_collection_page_requests_total (Counter): Collection pagination requests

Configuration Options

  • Allow selective enabling/disabling of metric categories
  • Configurable histogram buckets for different metric types
  • Label cardinality controls to prevent metric explosion

Alternatives Considered

  1. External monitoring only: Relying on external reverse proxies or APM tools, but this misses internal application-specific metrics like queue depths and federation-specific operations.

  2. Custom metrics implementation: Building a proprietary metrics system, but OpenTelemetry provides standardization and broad ecosystem support.

  3. Gradual implementation: Starting with just HTTP metrics, but comprehensive coverage from the start provides better operational visibility.

Scope / Dependencies

Affected Components

  • Core Federation class and request handling
  • Message queue implementations
  • Signature verification and key management
  • WebFinger and actor discovery
  • Collection dispatchers
  • Activity processing pipelines

Implementation Areas

  • New metrics collection throughout existing span instrumentation points
  • Configuration options in CreateFederationOptions
  • Documentation updates for metrics configuration
  • Example configurations for popular monitoring systems (Prometheus, etc.)

Dependencies

  • OpenTelemetry metrics API (already available in the JS SDK)
  • Minimal performance impact consideration for high-throughput scenarios
  • Backward compatibility with existing tracing configuration

Testing Requirements

  • Unit tests for metrics collection accuracy
  • Performance benchmarks to ensure minimal overhead
  • Integration tests with popular OpenTelemetry backends

This enhancement would significantly improve Fedify's production observability while maintaining the same ease-of-use that developers expect from the framework.

dahlia avatar Jul 20 '25 13:07 dahlia

@dahlia I'll try it!

beberiche avatar Aug 15 '25 07:08 beberiche

@beberiche This issue has been assigned for over two weeks without updates. Please provide a status update, or unassign yourself if you're unable to continue working on it.

github-actions[bot] avatar Oct 09 '25 09:10 github-actions[bot]

@beberiche This issue has been assigned for over two weeks without updates. Please provide a status update, or unassign yourself if you're unable to continue working on it.

github-actions[bot] avatar Oct 24 '25 01:10 github-actions[bot]