confluent-kafka-dotnet Distributed tracing story with Confluent Kafka client

Greetings!

I'm using OpenTelemetry to instrument an application, primarily to watch the performance of all its dependencies and find bottlenecks. OpenTelemetry is automatically capturing Http requests coming into my system and then tracing outgoing Http, SQL, & Redis calls made while doing the processing. All of that data is exported into one of the many tools OpenTelemetry supports for visualization. What I would like to do is mod the Confluent Kafka client to also participate in that process.

The easiest way to do that would be to use DiagnosticSource to allow other code in the process to subscribe to events that are fired as messages are produced or consumed. DiagnosticSource only fires when there is a subscriber so the cost to existing users not using the events would be ~0.

If we are open to this, I can take a stab at a PR, documentation, etc., but I wanted to check first.

/cc @SergeyKanzhelev @MikeGoldsmith @cijothomas

May 01 '20 04:05 CodeBlanch

CC: @davidfowl and @tarekgh

May 04 '20 21:05 SergeyKanzhelev

librdkafka, the native library on which Confluent.Kafka is based, implements KIP-42, which is an interceptor API for the producer and consumer. This is the natural place to add this I think - and you can do it completely independently of this project (but let us know!), but you'd need to write it in C. As an added bonus, this interceptor would work with all librdkafka bindings, including the confluent python and go clients.

Supporting DiagnosticSource may be an option, I haven't thought about it. If the change is small and has no performance impact I'm open to looking at it.

May 05 '20 00:05 mhowlett

One problem of implementing logic in native library is the need to exchange some information from managed code to the native library. Starting with TraceId and SpanId of a currently running Span, following the currently configured sampling rate, exporter and span processors. From what I seen before it's often much harder to set up this managed to native and back communication than implement a language specific solutions.

What level of details will be lost if instrumentation will be based on pure managed code using DiagnosticSource?

May 06 '20 05:05 SergeyKanzhelev

ahh right, yes, completely agree.

I don't see any details that would be lost, based on information presented in the kip.

May 06 '20 16:05 mhowlett

I'll take a stab at what DiagnosticSource integration might look like if no one objects. But I have a bit of a backlog so it might not be for a few days. If anyone else wants to jump on it, by all means :)

May 06 '20 16:05 CodeBlanch

thanks! yes, backlog at this end too. generally, straightforward changes with no controversy are easy to get merged, but changes beyond that are competing with prioritized tasks for time.

May 06 '20 16:05 mhowlett

@mhowlett The one big challenge I can foresee is the trace context. When producing a message, we would spin up an Activity. When consuming a message, we need to spin up an Activity with the same context that the producer had. In order to pull that off, we need to persist some data on the message. W3C Trace Context which basically comes down to 2 strings: traceparent & tracestate. So I'm thinking we would have to write those on the message header. Only if someone is subscribed to the DiagnosticSource, which is hinting they want the telemetry. How do you feel about that?

@SergeyKanzhelev please correct me if this is the wrong approach to take.

May 06 '20 16:05 CodeBlanch

it sounds like something that requires some thought. my instinct is that we don't want to hard code the message header decisions in the client - we probably want a general plugin API for it, and probably configuration via a builder class method (+ a W3C Trace Context impl. provided in-the-box). this capability could also be built in a class that wraps the producer / consumer of course. but it does seem generally useful enough that we'd like it in the client.

note: there is some chance we'll have some bandwidth to think about this in September, we have some related work planned I believe.

May 06 '20 17:05 mhowlett

@CodeBlanch yes, just store two strings in metadata would be enough. We started working on AMQP in W3C group: https://github.com/w3c/trace-context-amqp but it is only in proposal state now. I will try to push it forward and perhaps we can validate how it works with kafka. (I believe kafka is almost AMQP from the message structure view, right?)

May 06 '20 19:05 SergeyKanzhelev

@mhowlett I threw up PR #1278 with the Producer side of things so you could check it out. DiagnosticSource generally implements very cleanly, meaning it doesn't add a whole lot of ceremony on top of 'real work' code.

Right now I have the traceparent & tracestate headers hard-coded. They are only added/sent when there is someone subscribed to the DiagnosticSource. But it seems perfectly reasonable to make them configurable, happy to do that if you want. I was thinking maybe a prefix for "system" properties might be nice? I know on the Azure ServiceBus there are two "headers"-esque props on the message, SystemProperties & UserProperties, which lend themselves nicely to something like this IMO.

May 09 '20 17:05 CodeBlanch

FYI, there's an experimental page in the OpenTelemetry specification for conventions to use with messaging There's also some useful bits in the OpenTelemetry .Net library for attaching and propagating tracing context and baggage across activities.

Mar 25 '21 21:03 EricStG

@CodeBlanch Is your PR still under consideration for merging.

May 05 '21 10:05 ElectricVampire

@ElectricVampire I just closed it. Better to do it using ActivitySource now. That was released with .NET 5, but is actually supported all the way back to .NET 4.5 using the NuGet: https://www.nuget.org/packages/System.Diagnostics.DiagnosticSource/.

May 05 '21 16:05 CodeBlanch

Hi all, So, will it be there any efforts to provide instrumentation for Confluent.Kafka ? In fact it's relatively simple to implement it manually using OpenTelementry Tracer class, but it would be definately better to have it out of the box.

Also, maybe it worth to contribute the code to OpenTelemetry project ? i.e. https://github.com/open-telemetry/opentelemetry-dotnet-contrib

May 11 '21 19:05 eshepelyuk

Are there any plans to implement this feature?

Aug 03 '21 09:08 MoienTajik

Maybe there are assumptions about when this feature can be implemented?

@mhowlett

Nov 11 '21 19:11 chertby

I am also interested in this piece of content, if anyone has submitted any PR, please submit a reply.

Apr 01 '22 06:04 wajika

Please let me know once this feature is implemented.

May 13 '22 07:05 vpalukuri05

Hey everyone, JFYI I created a PR #1838 with a proposal for the Producer instrumentation.

Jun 14 '22 18:06 fedeoliv

Hello, It seems that @fedeoliv's PR has been abandoned. Are there plans to add this feature? In our company, different teams are creating their own extensions, but it would be much better to use the standard approach.

Thanks in advance!

Jul 14 '23 10:07 inech

confluent-kafka-dotnet confluent-kafka-dotnet copied to clipboard

Distributed tracing story with Confluent Kafka client

confluent-kafka-dotnet
confluent-kafka-dotnet copied to clipboard