opentelemetry-dotnet-contrib
opentelemetry-dotnet-contrib copied to clipboard
Open Telemetry places different http requests under the same Trace
Hello,
Nuget List:
- OpenTelemetry.Exporter.Jaeger Version=1.2.0-rc1
- OpenTelemetry.Exporter.OpenTelemetryProtocol Version=1.2.0-rc1
- OpenTelemetry.Exporter.Zipkin Version=1.2.0-rc1
- OpenTelemetry.Extensions.Hosting Version=1.0.0-rc8
- OpenTelemetry.Instrumentation.AspNetCore Version=1.0.0-rc8
- OpenTelemetry.Instrumentation.Http Version=1.0.0-rc8
- OpenTelemetry.Instrumentation.SqlClient Version=1.0.0-rc8
- OpenTelemetry.Instrumentation.StackExchangeRedis Version=1.0.0-rc8
Runtime version
- net6.0
Symptom
We have configured a web API (with a single endpoint right now) to use OpenTelemetry libraries in order to capture:
- Incomming requests
- Outgoing requests
- Database (SqlServer) commands
- Redis communication
The behavior inside endpoint is the following:
- Receive query parameters,
- Check if data are in Redis
- If Yes, return the value
- If No, access database, store value in Redis, return the value
We are exporting the information using Jaeger. Our app is deployed to Kubernetes and Jaeger agent is deployed as side-car. Jaeger is working with ElasticSearch. Jaeger version v 1.28.0.
We noticed that a significant amount of requests are placed under a common trace id. (Please check the attached images)
What is the expected behavior?
We would expect, each request to be under an independent trace id. This is happening in many requests, but not in all. Hard to say which is the majority.
We found a trace with more than 50.000 spans inside. It was running for more than a day and thousant of Requests were placed under a specific TraceId, handled as spans.
What did you expect to see?
Reproduce
I cannot reproduce the problem.
We have the same code in the Development environment, on a different cluster with its own Jaeger infrastructure (Elastic search, same version) but with less resources than production.
Initially i thought that it was a matter of lost traces - since i had noticed that locally while developing another API, with Jaeger-in -memory. I had read some articles where making a refresh on the Jaeger UI, issue is fixed - but this is not our case.
I tried to reproduce the problem, stressing the development environment, without success.
Production has much better resources than development. Production receives about 100 request per second, while I stressed development with more than 5.000.
On the same production cluster we have other Web Apis, configured with Open Telemetry, sending data to same Jaeger infrastructure, without facing same problem.