tempo
tempo copied to clipboard
Migrate from OpenTracing to OpenTelemetry instrumentation
What this PR does:
- migrate from OpenTracing to OpenTelemetry instrumentation
- update dskit to c41c58734f46 (requires https://github.com/grafana/dskit/pull/518)
Which issue(s) this PR fixes: Fixes #2224
Checklist
- [ ] Tests updated
- [x] Documentation added
- [x]
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]
This is a first draft to replace OpenTracing with OpenTelemetry SDK.
Some open questions
- is creating a tracer per package best practice? https://github.com/open-telemetry/opentelemetry-go/discussions/4532
- OTEL only supports Int64 attributes (which later are internally converted to uint64), Tempo uses uint64 at a few places
- some attributes had to be converted to a string (I guess this is fine, and was done previously also)
- should we introduce new config options to select the exporter (Jaeger or OTEL, default to Jaeger for backward compatibility)? Afaics the OTEL_TRACES_EXPORTER env var doesn't support selecting Jaeger as an exporter.
Thanks for the PR, this looks like its on the right track to me.
- Creating a new tracer per package does seem like the way to go for the transition to the OTEL SDK.
- unit64 -> int64 attributes is likely fine. I don't think traceql knows the difference.
- I don't worry too much about backwards compatibility here since I think the number of users who will care is small, but we need to mark it as a breaking change. This will impact us internally and we will need to make preparations for this to land in main.
One concern I have is whether we lose the server spans. I see the dskit change linked above, but I'm not clear this addresses the issue. Currently, many spans are generated on the dskit server, which connects various components since the server is used for all components for grpc and http, and these spans are using the opentracing SDK. Are you able to confirm the impact here? I can test this PR out once you give the nod that you're ready.
I will mention that there has been some effort in the past to align this between all databases, and since dskit is in the mix, we may not be able to make this change on our own. I've had a coupe branches of my own which look similar, but the concern above blocked forward movement. I expect that we can have updated information here next quarter, but the complexity of moving all databases gave us some pause previously, and so some coordination may be prudent.
One concern I have is whether we lose the server spans. I see the dskit change linked above, but I'm not clear this addresses the issue. Currently, many spans are generated on the dskit server, which connects various components since the server is used for all components for grpc and http, and these spans are using the opentracing SDK. Are you able to confirm the impact here? I can test this PR out once you give the nod that you're ready.
So far my testing consisted of running two tempo instances (monolithic mode), one with this PR and one without, setting up two data sources in Grafana and opening the explore page in separate tabs, and then comparing random samples of traces. Initially the context propagation of the HTTP GET - api_search trace was broken for example, but this is fixed with https://github.com/grafana/dskit/pull/518.
The HTTP <method> - <routename> spans are created in dskit using OpenTracing SDK, and they are correctly linked with the ones from the OTEL SDK by the OpenTracing bridge (@kvrhdn set up the bridge in https://github.com/grafana/tempo/pull/842, I didn't change this part).
Yes, you can test this PR already. I didn't test the microservices mode yet, I'll do that next (therefore the PR is still in draft mode, and I want to compare more sample traces).
I will mention that there has been some effort in the past to align this between all databases, and since dskit is in the mix, we may not be able to make this change on our own. I've had a coupe branches of my own which look similar, but the concern above blocked forward movement. I expect that we can have updated information here next quarter, but the complexity of moving all databases gave us some pause previously, and so some coordination may be prudent.
I think it should work fine if dskit is instrumented with OpenTracing, as long as we don't remove the OpenTracing-OpenTelemetry bridge.
I tested it in microservices mode today, looks good so far. Please let me know if you find any broken traces.
Greetings. Wondering if we'll need to update the docs for your PR changes. Will this migration impact users? For example, are there changes to configuration files? IF there are updates, we can create a doc issue for your work or you can add the doc updates to this or a separate PR.
I took some time to deploy this, and it looks pretty good to me. Nice work. Here is a 6 hour window, and you can't really tell when the deployment happened.
Looking at alloy for the same period, you can see the transition in receiver protocol, but not much else.
It might be nice to include an option for using GRPC. Additionally, to @knylander-grafana 's point, we could call out the change in environment variables where it makes sense. I changed OTEL_EXPORTER_JAEGER_ENDPOINT -> OTEL_EXPORTER_OTLP_TRACES_ENDPOINT which was enough.
@knylander-grafana Yes, there is impact for some users. I'll update the docs in this PR (this section: https://grafana.com/docs/tempo/latest/operations/monitor/#traces).
I'll also add gRPC support and update the env vars next week.
@zalegrala do you have a metric for the (average) number of spans per trace in your deployment? I'm mainly concerned about broken traces (broken context propagation), this should be visible if the number of spans per trace is different before and after applying this PR.
I don't have that metric currently, but I think we can work that up, perhaps next week. I expect the change from grpc -> http will mean there is more negotiation and slightly increase in spans, but I share the concern of broken traces. If we do have some after merge, then we can follow up, but it would be good to identify as much as we can ahead of time.
I moved from the otlptracehttp to autoexport, which lets us control the transport via OTEL_EXPORTER_OTLP_PROTOCOL (defaults to http/protobuf).
I also updated the docs, changelog, docker compose example file and resolved all merge conflicts.
Today I noticed one issue in the logs around OpenCensus/metrics:
level=error ts=2024-05-29T14:23:47.75153709Z caller=main.go:259 msg=OpenTelemetry.ErrorHandler err="starting span \"ExportMetrics\": unsupported sampler: 0x1e42800"
afaics it's due to the OpenCensus bridge at https://github.com/census-instrumentation/opencensus-go/blob/master/metric/metricexport/reader.go#L191 with sampler = https://github.com/census-instrumentation/opencensus-go/blob/master/metric/metricexport/reader.go#L30, which the bridge can't handle: https://github.com/open-telemetry/opentelemetry-go/blob/89c32cbd368fc786f5d3963f6979aedce4b27bb6/bridge/opencensus/internal/oc2otel/tracer_start_options.go#L31-L33
I don't know where Tempo uses OpenCensus metrics though, afaics Tempo does use prometheus/client_golang for metrics.
The error you point to isn't new with this change, so probably safe to ignore, though would be nice to clean up.
Found the one occurrence of OpenCensus metrics: modules/distributor/receiver/metrics.go. It's registering a few metrics with the DefaultRegisterer of prometheus.
But I'm unsure what the code is doing, I can't find any place where a metric value is set. The return value of newMetricViews() is assigned to receiversShim.metricViews, but never read again. And the defined metrics don't show up at the tempo:3200/metrics endpoint (even without this PR).
@andreasgerstmayr You have some good work in this PR but there are some conflicts that need to be resolved.
@andreasgerstmayr You have some good work in this PR but there are some conflicts that need to be resolved.
Yes, unfortunately this PR gets out of date very quickly because it modifies many files. I've split off the update of the dskit dependency to a separate PR #3915 to make this PR a bit smaller. Once the other PR is merged, I'll fix the merge conflicts of this PR again.
Nice work @andreasgerstmayr. Thanks for following up on this.
Thanks for the review & merging! :)