[Enhancement]: Performance Counters with OpenTelemetry
What is it?
Use OpenTelemetry to add tracing events and top-level counters for exporting to monitors and the health endpoint.
Value prop
Besides aligning with industry trends, CorrelationId (TraceId) is crucial for cross-service traceability in OpenTelemetry. It links spans across multiple services, providing a full view of a request’s lifecycle in distributed systems.
Versus logging
This does not mean Serilog is unnecessary. Serilog handles logging, while OpenTelemetry focuses on tracing and metrics. Both are useful and often work together.
OpenTelemetry
OpenTelemetry is the standard for shipping metrics and events in Azure.
Learn more
OpenTelemetry can work alongside ILogger for logging, but metrics and traces are handled separately. OpenTelemetry is designed for distributed tracing and metrics collection, while ILogger focuses on logging.
Concepts
| Concept | Description |
|---|---|
| Meter | A component that creates and manages metrics (e.g., counters, histograms) to track real-time performance data. |
| Metric | A single value updated programmatically. |
| Here’s a table of the common types of metrics in OpenTelemetry: | |
| Counter | A single value |
| Histogram | A distribution of values |
| Event | A point-in-time log or action recorded within a span. |
| Activity (Span) | A time-bound operation with a start and end, with zero or more events. |
| Trace | A collection of spans representing a full operation lifecycle across services. |
| TraceId | A CorrelationId automatically incorporated by middleware or generated. |
| Propagator | Injects and extracts TraceId, typically via headers. |
| Exporter | Sends telemetry data to systems (e.g., Prometheus, Jaeger, Zipkin). |
| Sampler | Decides which traces to capture and whether to record/export them. |
| AlwaysOnSampler | Records all traces. |
| AlwaysOffSampler | Discards all traces. |
| ParentBasedSampler | Uses the sampling decision of the parent span. |
| TraceIdRatioBasedSampler | Samples a percentage of traces based on a ratio. |
| Resource | Metadata describing the entity producing telemetry data (e.g., service name). |
Code Sample
Relevant NuGet packages
OpenTelemetry is ASP.NET middleware:
| Package | Description |
|---|---|
| OpenTelemetry.Extensions.Hosting | Provides extensions for integrating OpenTelemetry into ASP.NET Core hosting services. |
| OpenTelemetry.Instrumentation.AspNetCore | Automatically instruments incoming and outgoing HTTP requests in ASP.NET Core applications. |
| OpenTelemetry.Instrumentation.Runtime | Captures metrics about .NET runtime performance (e.g., GC, exceptions). |
| OpenTelemetry.Instrumentation.Http | Instruments outgoing HTTP requests to track their performance and errors. |
| OpenTelemetry.Exporter.Console | Exports telemetry data (metrics, traces, logs) to the console for development and debugging purposes. |
| OpenTelemetry.Exporter.Prometheus.AspNetCore | Exposes metrics in a format Prometheus can scrape, integrating with the Prometheus monitoring system. |
| OpenTelemetry.Instrumentation.SqlClient | Instruments database operations made via SQL client to track performance and errors in SQL queries. |
Metrics & Traces to Add to Data API builder
| Name | Type | Description | Partition |
|---|---|---|---|
| Request Count | Metric (Count) | Tracks the number of API requests processed. | Per endpoint |
| Request Duration | Metric (Histogram) | Measures the time taken to process each API request. | Per endpoint |
| Error Rate | Metric (Count) | Tracks the number of failed API requests. | Per endpoint |
| DB Query Span | Trace | Captures the duration of database queries per API request. | Per API request |
| Authorization Check | Trace | Tracks the time taken to validate user permissions. | Per API request |
| Cache Hit/Miss Event | Event | Logs when a cache hit or miss occurs during a request. | Per cache event |
| Startup Event | Event | Records the time and status when the API starts. | Global |
-
Per API request: Captures metrics or traces for each individual API request, giving a detailed view of specific executions (e.g., database query times for each call).
-
Per endpoint: Aggregates metrics or traces based on the API endpoint (e.g.,
/api1,/api2), providing overall performance stats for specific API routes. -
Per cache event: Tracks when cache hits or misses occur, logging each event as it happens.
-
Global: Applies to the entire application or service (e.g., API startup events), capturing broad, system-wide metrics or events.
Discussion
- Should we update Application Insights?
- Should we support Prometheus
/metrics? - Do we need custom metrics considering what we already have?
- Fusion Cache DOES have OpenTelemetry support.
- Hot Chocolate DOES have OpenTelemetry support?
- /metrics endpoint cannot be path of rest/gql.
- How does the user configure this?
https://dateo-software.de/blog/improve-your-applications-observability-with-custom-health-checks
Hi @JerryNixon, regarding the configuration topics, I think we could try something like:
{
"runtime": {
...
"telemetry": {
"otel": {
"enabled": true,
"endpoint": "@env('OTEL_EXPORTER_OTLP_ENDPOINT')"
}
},
...
}
}
Doing it this way, the OTEL config can live alongside the already existing one for appinsights, and we could handle both config at code level as .NET Aspire does.