rabbitmq-dotnet-client icon indicating copy to clipboard operation
rabbitmq-dotnet-client copied to clipboard

Use System.Diagnostics.DiagnosticSource for tracing and telemetry

Open stebet opened this issue 4 years ago • 19 comments

This is something that I have been wondering about for some time on what would be the best approach for observability.

Since debugging a library like this can be hard, I wonder what we'd want to do to make this easier.

I'd really like to suggest using System.Diagnostics.DiagnosticSource to provide parity with a lot of other .NET libraries and frameworks (both Core and Framework).

Using this library it becomes a lot easier to plug in logging frameworks and profiling tools to monitor and measure operations within the library. It also has the added benefit to have next to no overhead when no-one is listening for the events, since it is exposing ETW events (and in .NET Core, can expose event counters).

stebet avatar Mar 25 '20 11:03 stebet

I also suggest looking into the new Tracing API on Activity which should be part of .NET 5 release.

vhatsura avatar May 06 '20 19:05 vhatsura

Yeah, that looks good. A good first step would be to add basic Activity support, ETW and event counters. I'm planning to look at that soon-ish. This would be a logical next step after that.

stebet avatar May 06 '20 21:05 stebet

Hi @stebet, just wanted to follow-up on your previous note and see if you had any sorta update or a product roadmap you could share around Activity support?

jaycdave88 avatar Sep 18 '20 21:09 jaycdave88

@jaycdave88 right now the plan is to investigate System.Diagnostics.DiagnosticSource. This may or may not happen by 7.0 but can go into a 7.x release. We need to spike it first. If there are popular alternatives in the community, we would consider them. Let us know ;)

michaelklishin avatar Sep 19 '20 01:09 michaelklishin

@michaelklishin Depending on the timeline, definitely check out the new ActivitySource API. Very similar to DiagnosticSource, but it supports sampling and you can write tags/attributes on the events instead of writing raw objects (which people usually need to consume reflectively). It releases with .NET 5 (~November), but you don't need .NET 5 to use it. The NuGet package targets all actively supported .NET versions.

The main benefit to using ActivitySource is it means instrumentation libraries like OpenTelemetry will work without needing to build an adapter or anything.

Also happy to help with this if you are open to it.

CodeBlanch avatar Sep 19 '20 03:09 CodeBlanch

If ActivitySource will support all then/currently supported versions of .NET, we should consider it. A proof-of-concept PR would be appreciated (I assume there are preview releases of some kind that developers can try?)

michaelklishin avatar Sep 20 '20 18:09 michaelklishin

ActivitySource should be the correct one to look at. I spiked this myself a few months back, time to revisit :)

What would people consider good activities/counters to start with?

Here are a few rough ideas:

  • Bytes sent
  • Bytes received
  • Frames deserialized
  • Frames serialized
  • Total Connections opened
  • Connections currently open
  • Consumers active
  • Channels Open
  • Total channels opened
  • ???

stebet avatar Sep 21 '20 14:09 stebet

If ActivitySource will support all then/currently supported versions of .NET, we should consider it. A proof-of-concept PR would be appreciated (I assume there are preview releases of some kind that developers can try?)

ActivitySource/Activity are part of the nuget package https://www.nuget.org/packages/System.Diagnostics.DiagnosticSource/5.0.0-rc.1.20451.14. It can be used right now with preview releases. An example on using ActivitySource is available here in OpenTelemetry repo. https://github.com/open-telemetry/opentelemetry-dotnet/blob/master/examples/Console/TestConsoleExporter.cs#L41-L94

cijothomas avatar Sep 21 '20 23:09 cijothomas

ActivitySource should be the correct one to look at. I spiked this myself a few months back, time to revisit :)

What would people consider good activities/counters to start with?

Here are a few rough ideas:

  • Bytes sent
  • Bytes received
  • Frames deserialized
  • Frames serialized
  • Total Connections opened
  • Connections currently open
  • Consumers active
  • Channels Open
  • Total channels opened
  • ???

I think something likeFrames serializing should be considered, the telemetry instruments may want to inject tracing header into the frame before it serialized to send.

snakorse avatar Nov 19 '20 23:11 snakorse

Just to clarify - ActivitySource/Activity API is for tracing , and not for metrics. Things like BytesSent, etc are metric, which require a metric API. .NET has EventCounter API to support this already, but it is a very limited API. There is a new Metric API (Opentelemetry compatible), coming with .NET 6 (Nov 2021), which will support all the old versions of the .NET itself. (https://github.com/open-telemetry/opentelemetry-dotnet/issues/1501)

cijothomas avatar Nov 20 '20 00:11 cijothomas

@cijothomas Yep. I'm aware of the differences. We're looking to integrate both the new dotnet counters (for metrics) as well as support for Activity Tracing. Tracing will come later since the OpenTelemetry is currently being worked on but I'm watching it closely :)

stebet avatar Nov 20 '20 13:11 stebet

@stebet Thanks! To instrument, you won't need OpenTelemetry dependency, right? You can use 5.0.0 of DiagnosticSource. (Also OpenTelemetry will release 1.0.0 in <2 weeks)

Please also check the Metric plans in .NET 6 as well.

cijothomas avatar Nov 20 '20 16:11 cijothomas

@stebet Thanks! To instrument, you won't need OpenTelemetry dependency, right? You can use 5.0.0 of DiagnosticSource. (Also OpenTelemetry will release 1.0.0 in <2 weeks)

Please also check the Metric plans in .NET 6 as well.

Thanks for the info :)

stebet avatar Nov 20 '20 17:11 stebet

Also consider adopting the recommended tags and naming conventions from OpenTelemetry.

And if you need an example of dealing with ActivitySource, I added support for the Trace Context standard and activity source for NServiceBus:

https://github.com/jbogard/NServiceBus.Extensions.Diagnostics/blob/master/src/NServiceBus.Extensions.Diagnostics/IncomingPhysicalMessageDiagnostics.cs

https://github.com/jbogard/NServiceBus.Extensions.Diagnostics/blob/master/src/NServiceBus.Extensions.Diagnostics/OutgoingPhysicalMessageDiagnostics.cs

https://github.com/jbogard/NServiceBus.Extensions.Diagnostics/blob/master/src/NServiceBus.Extensions.Diagnostics/SettingsActivityEnricher.cs

jbogard avatar Nov 30 '20 20:11 jbogard

ActivitySource should be the correct one to look at. I spiked this myself a few months back, time to revisit :)

What would people consider good activities/counters to start with?

Here are a few rough ideas:

  • Bytes sent
  • Bytes received
  • Frames deserialized
  • Frames serialized
  • Total Connections opened
  • Connections currently open
  • Consumers active
  • Channels Open
  • Total channels opened
  • ???

I wanted to play a bit with it and found that the PollingCounters are only NS2.1 / .net core 3.0. I guess that means we can't do it for NS.2.0 targets or am I missing something?

bollhals avatar Mar 08 '21 22:03 bollhals

Let's make a distinction between activities and counters since they aren't the same thing (although related).

There are Event counters in NS2.0 targets, but it's a lot less capable. It'd be fine by me if we only had counters for NS2.1+ and skip them for NS2.0, which is on the way out eventually anyway.

Activities can work for both targets.

stebet avatar Mar 09 '21 08:03 stebet

Shall I create a new one for the counters, so we don't mix them up in here?

There are Event counters in NS2.0 targets, but it's a lot less capable.

Are there? I wasn't able to find any useable classes?

It'd be fine by me if we only had counters for NS2.1+ and skip them for NS2.0, which is on the way out eventually anyway.

Works for me 👍

bollhals avatar Mar 09 '21 12:03 bollhals

When using OpenTelemetry for tracing across services, the span context needs to be channeled over the wire. Will this issue contain the encoding (while publishing) and decoding (while receiving) from and to the .NET Activity framework be included here or is this expected to be done somewhere else? To my limited insights, I do not see a hookup infrastructure where a third party could include a middleware into the bare RabbitMQ.Client when sending/receiving messages.

I am currently building a middleware layer on top of RabbitMQ and wonder, whether I should write a "default" middleware doing the OpenTelemetry decoding.

The code how this OpenTelemetry span context propagation works without support from the RabbitMQ.Client is described in this article. The code fragments in the article also show how it works transparently for HTTP.

tthiery avatar Jun 13 '21 08:06 tthiery

We similarly have a wrapper around RabbitMQ.Client we've been using for a few years (that exposes consumers/exchanges as ISource/TargetBlock<AmqpMessage>) which we instrumented for OpenTracing and are looking at moving to OpenTelemetry/Activity.

We currently extract/inject trace context into message headers on consume/publish but it would be nice if the base client library handled that as well as the Activity creation.

Where I'm uncertain is how to handle the length of the Activity (we only use Consumer based flows, but manual Get would likely need to be different?). There is the period from Prefetch to the Consumer implementation finishing the event/callback. The "processing" of the message could either be inside that span, resulting in an ack/nack before returning from the Consumer event/callback or, as in our framework, the AmqpMessage could be placed in a client side queue (ex. BufferBlock) and processed potentially after the Consumer event/callback is completed (ex. by an attached ActionBlock). (We stash the Channel in the AmqpMessage headers so we can ack/nack the message when processing is completed).

I suppose that the client side queue could be modeled in OpenTelemetry as another Consumer-type span with a parent as the Consumer event/callback span that would be completed as soon as the event/callback span has placed the message in the in-process queue.

rwkarg avatar Jun 15 '21 18:06 rwkarg

Is this issue well understood? Are there any design blockers getting this through?

davidfowl avatar Oct 14 '23 18:10 davidfowl

@davidfowl it has been over two years since it was last discussed. So no, it is not well defined.

michaelklishin avatar Oct 14 '23 20:10 michaelklishin

I've been following this so far https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/examples/MicroserviceExample/Utils/Messaging

It's an example on the open telemetry dotnet repo that uses RabbitMQ.Client and adds the appropriate activities to the producer and consumer.

If we start with this as the proposal, what issues would we run into?

davidfowl avatar Oct 14 '23 20:10 davidfowl

I think it may be worth to share with you the OpenTelemetry semantic conventions (they are experimental, but there is nothing better) for telemetry produced for RabbitMQ:

  • https://opentelemetry.io/docs/specs/semconv/messaging/messaging-spans/
  • https://opentelemetry.io/docs/specs/semconv/messaging/rabbitmq/

pellared avatar Oct 14 '23 21:10 pellared

That question can only be answered by trying it but this looks like an example, not something that can or should be taken straight into production.

This example has some curious decisions around how it uses this client. Instead of inheriting from/extending EventingBasicConsumer it uses a "helper", channel opening and queue declaration are lumped together, and so on.

I assume that everything around how they use activities is reasonable. The question for library maintains use how do you integrate that in a way that's not too restrictive and does not push OpenTelemetry down everyone's throat.

michaelklishin avatar Oct 14 '23 21:10 michaelklishin

I'd look for clients for other popular data services (PostgreSQL, Redis, ElasticSearch, etcd, et cetera) to learn from, these examples only demonstrate some key OpenTelemetry API elements (in C#).

michaelklishin avatar Oct 14 '23 21:10 michaelklishin

@pellared that's helpful, thanks. Now someone who needs this feature should be able to come up with an API update proposal for this client, which won't be trivial. Team RabbitMQ likely won't get to work on this any time soon but this is open source software.

michaelklishin avatar Oct 14 '23 21:10 michaelklishin

This can be contributed, it’s extremely important to get our core ecosystem libraries instrumented. Let’s get on the same page about how to accomplish this.

The way .NET integrates with open telemetry allows library authors to build support without direct dependencies on otel. Even though the semantic conventions are not final, you want to track and use them. This is what we are doing in .NET and making changes as they change .

davidfowl avatar Oct 14 '23 22:10 davidfowl

The other alternative would be to build it an instrumentation library on top, but, that’s not ideal and you’d need the appropriate hooks in order to do it from the outside of the library.

This would be a point in time solution. This isn’t the way we want libraries to integrate with our telemetry systems.

davidfowl avatar Oct 14 '23 22:10 davidfowl

This is already in progress for the most part here: https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/1261

Had some comments from @lmolkova as well that I need to address. I'll take a look at this again soon. Would love more feedback as well.

Some attributes need to be updated to be in line with the standards (it's been awhile since I last synced), but this is working for the most part. CC: @davidfowl @michaelklishin

stebet avatar Oct 18 '23 11:10 stebet

Also, implementing it like this via. ActivitySource shouldn't require any API changes. A good PR after the initial implentation would be to add OTel Metrics/Logs support as well.

stebet avatar Oct 18 '23 11:10 stebet