Comprehensive OpenTelemetry metrics collection
Summary
Add comprehensive OpenTelemetry metrics collection to complement the existing tracing capabilities. Currently, Fedify only collects spans for tracing, but lacks metrics for monitoring performance, throughput, error rates, and system health in production environments.
Problem
While Fedify provides excellent OpenTelemetry tracing support for debugging and understanding request flows, operators and developers lack crucial metrics for:
- Performance monitoring: Request latencies, throughput rates, and processing times
- Error tracking: Failure rates, retry attempts, and error types across different operations
- Resource utilization: Queue depths, cache hit rates, and concurrent request counts
- Federation health: Activity delivery success rates, peer connectivity, and signature verification metrics
- Capacity planning: Understanding bottlenecks and scaling requirements
This gap makes it difficult to:
- Set up proper alerting for production issues
- Identify performance bottlenecks proactively
- Monitor the health of federation activities
- Plan for capacity and scaling needs
- Create comprehensive dashboards for operational visibility
Proposed Solution
Implement comprehensive OpenTelemetry metrics collection across key areas of Fedify operations:
1. HTTP & ActivityPub Request Metrics
fedify_http_requests_total(Counter): Total HTTP requests by method, route, statusfedify_http_request_duration_seconds(Histogram): Request processing latenciesfedify_activitypub_requests_total(Counter): ActivityPub operations by type and status
2. Federation Activity Metrics
fedify_inbox_activities_total(Counter): Received activities by type and processing statusfedify_outbox_activities_total(Counter): Sent activities by type and delivery statusfedify_activity_fanout_recipients(Histogram): Number of recipients per activityfedify_activity_delivery_duration_seconds(Histogram): Activity delivery latenciesfedify_activity_delivery_retries_total(Counter): Retry attempts by activity type
3. Queue & Task Processing Metrics
fedify_queue_depth(Gauge): Current queue sizes for inbox/outbox processingfedify_queue_processing_duration_seconds(Histogram): Task processing timesfedify_queue_tasks_total(Counter): Completed, failed, and retried tasks
4. Security & Authentication Metrics
fedify_http_signatures_total(Counter): HTTP signature operations and success ratesfedify_ld_signatures_total(Counter): Linked Data signature operationsfedify_object_integrity_proofs_total(Counter): Object integrity proof operationsfedify_signature_verification_duration_seconds(Histogram): Signature verification times
5. Caching & Key Management Metrics
fedify_key_lookups_total(Counter): Key lookups with cache hit/miss statusfedify_key_cache_size(Gauge): Number of cached keysfedify_key_fetch_duration_seconds(Histogram): Remote key fetch latencies
6. WebFinger & Discovery Metrics
fedify_webfinger_lookups_total(Counter): WebFinger lookup requests and outcomesfedify_actor_handle_lookups_total(Counter): Actor handle resolution attempts
7. Collection & Pagination Metrics
fedify_collection_items_total(Histogram): Items per collection by typefedify_collection_page_requests_total(Counter): Collection pagination requests
Configuration Options
- Allow selective enabling/disabling of metric categories
- Configurable histogram buckets for different metric types
- Label cardinality controls to prevent metric explosion
Alternatives Considered
-
External monitoring only: Relying on external reverse proxies or APM tools, but this misses internal application-specific metrics like queue depths and federation-specific operations.
-
Custom metrics implementation: Building a proprietary metrics system, but OpenTelemetry provides standardization and broad ecosystem support.
-
Gradual implementation: Starting with just HTTP metrics, but comprehensive coverage from the start provides better operational visibility.
Scope / Dependencies
Affected Components
- Core Federation class and request handling
- Message queue implementations
- Signature verification and key management
- WebFinger and actor discovery
- Collection dispatchers
- Activity processing pipelines
Implementation Areas
- New metrics collection throughout existing span instrumentation points
- Configuration options in
CreateFederationOptions - Documentation updates for metrics configuration
- Example configurations for popular monitoring systems (Prometheus, etc.)
Dependencies
- OpenTelemetry metrics API (already available in the JS SDK)
- Minimal performance impact consideration for high-throughput scenarios
- Backward compatibility with existing tracing configuration
Testing Requirements
- Unit tests for metrics collection accuracy
- Performance benchmarks to ensure minimal overhead
- Integration tests with popular OpenTelemetry backends
This enhancement would significantly improve Fedify's production observability while maintaining the same ease-of-use that developers expect from the framework.
@dahlia I'll try it!
@beberiche This issue has been assigned for over two weeks without updates. Please provide a status update, or unassign yourself if you're unable to continue working on it.
@beberiche This issue has been assigned for over two weeks without updates. Please provide a status update, or unassign yourself if you're unable to continue working on it.