tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Trace propagation with `Traceparent` and otel tracer doesn't work

Open wperron opened this issue 3 years ago • 6 comments

Describe the bug

When using the experimental -use-otel-trace=true flag, traces aren't propagated correctly, even when the client sends the Traceparent header in the request. I believe the problem comes from the ExtractTraceID from the weaveworks common lib, which looks for a Jaeger SpanContext in the request Context, but I'm not 100% sure.

To Reproduce Steps to reproduce the behavior:

  1. Start Tempo v1.3.0
  2. Send a request to any API (I used /api/search and /api/traces/{traceID} when testing) along with a valid Traceparent header. Make sure your client script is exporting traces to Tempo as well.
  3. Grab the trace ID sent by the client and look for it in Tempo.

Expected behavior

I would expect that enabling the OpenTelemetry tracer would automatically enable trace propagation following the Trace Context specification.

wperron avatar Feb 22 '22 18:02 wperron

Any work ongoing related to this ?

knfoo avatar May 10 '22 10:05 knfoo

None that I'm aware of. We continue to use the Jaeger clients internally.

These, however, have been deprecated and we intend to move to OTel clients at some point.

joe-elliott avatar May 16 '22 13:05 joe-elliott

Is this still unresolved or does a workaround exist?

I have a client app sending traces to Grafana cloud and making http calls to a server api, which is also sending its own traces to the same Grafana cloud tempo trace data source. It would be great to show the server spans inside the caller traceparent! Isn't there any way to achieve it with opentelemetry sdk and Grafana cloud Tempo traces? Honestly I'm surprised that those traces are not linked or linkable, considering that they are just in the same data source (while distributed tracing is aiming to propagate even through different tracing systems)

giuliohome avatar Aug 25 '22 22:08 giuliohome

Yes, this is working!

In particular you can adapt it to detached span too considring this.

client side, where the first root span is started and must be the parent

>>> a=tracer.start_span('first')
>>> a.context
SpanContext(trace_id=0x94f61d7bd16071fb6d826ccfefec5c83, span_id=0x6947945713d7c017, trace_flags=0x01, trace_state=[], is_remote=False)
>>> from opentelemetry.trace.propagation import set_span_in_context
>>> ctx = set_span_in_context(a)
>>> ctx
{'current-span-6e6e35f7-bc0b-4ec9-84f0-7af5b96ec769': _Span(name="first", context=SpanContext(trace_id=0x94f61d7bd16071fb6d826ccfefec5c83, span_id=0x6947945713d7c017, trace_flags=0x01, trace_state=[], is_remote=False))}
>>> prop.inject(carrier=carrier,context=ctx)
>>> carrier
{'traceparent': '00-94f61d7bd16071fb6d826ccfefec5c83-6947945713d7c017-01'}

server side, where the second root span is started and must be the child

>>> ctx = prop.extract(carrier=carrier)
>>> ctx
{'current-span-6e6e35f7-bc0b-4ec9-84f0-7af5b96ec769': NonRecordingSpan(SpanContext(trace_id=0x94f61d7bd16071fb6d826ccfefec5c83, span_id=0x6947945713d7c017, trace_flags=0x01, trace_state=[], is_remote=True))}
>>> b=tracer.start_span('second',context=ctx)
>>> b.end()
>>> {
    "name": "second",
    "context": {
        "trace_id": "0x94f61d7bd16071fb6d826ccfefec5c83",
        "span_id": "0x45f7b33030725702",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x6947945713d7c017",
    "start_time": "2022-08-26T03:13:18.598479Z",
    "end_time": "2022-08-26T03:13:23.023070Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {},
    "events": [],
    "links": [],
    "resource": {
        "service.name": "service-pyshell"
    }
}

the magic begins: from the above you get a <root span not yet received> in grafana cloud trace search!

>>> a.end()
>>> {
    "name": "first",
    "context": {
        "trace_id": "0x94f61d7bd16071fb6d826ccfefec5c83",
        "span_id": "0x6947945713d7c017",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2022-08-26T03:11:19.586685Z",
    "end_time": "2022-08-26T03:13:54.359962Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {},
    "events": [],
    "links": [],
    "resource": {
        "service.name": "service-pyshell"
    }
}

the magic is done! you'll find that second is child of first in grafana cloud trace view!

giuliohome avatar Aug 26 '22 03:08 giuliohome

Nice work digging into this! I believe this issue is about Tempo itself propagating traces when using the otel tracer. OTel tracing ingested by Tempo and produced by other applications should be fine.

joe-elliott avatar Aug 26 '22 15:08 joe-elliott

OTel tracing ingested by Tempo and produced by other applications should be fine.

@joe-elliott yes, you're right, but that was not completely obvious and trivial to me, so I wanted to understand it well by implementing the whole thing myself. Sorry for being a little off topic here, but yes, in the end it is a confirmation that your above quoted sentence is true, indeed. 😉

giuliohome avatar Aug 26 '22 17:08 giuliohome

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Nov 16 '22 00:11 github-actions[bot]