opentelemetry-dotnet-contrib icon indicating copy to clipboard operation
opentelemetry-dotnet-contrib copied to clipboard

Add ConfluentKafka instrumentation

Open g7ed6e opened this issue 1 year ago • 30 comments

Fixes #1485.

Changes

Add an instrumentation package for Confluent.Kafka library. Implement tracing and propagation through Kafka message headers Implement metrics

For significant contributions please make sure you have completed the following items:

  • [ ] Appropriate CHANGELOG.md updated for non-trivial changes
  • [ ] Design discussion issue #
  • [ ] Changes in public API reviewed

g7ed6e avatar Dec 12 '23 16:12 g7ed6e

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: g7ed6e / name: Guillaume Delahaye (9d0122cd9a5589e2c28986d3ad6d47456201865f, bae86e91c9215ff099be92636a0c6da1efefeb3b, b429a6f44ea1601a6a4a9dedc123053b33b462e0, 5afe9e052ed64fe50e4c1bace90c43c977a5d47e)
  • :white_check_mark: login: CodeBlanch / name: Mikel Blanchard (f3c4df8b5e88dcee7899297011b71478f98c9148, cdd2470cf59ee5fb12049ef2acf038b0f3a44040, a4fc992e507bea855489b7034b8129a8ec0c6120, 9be7c1874067bfb2082f526895b4498110a797b5, 400fdc15ccc3521bac45b6fa64f873dd402c77b9, 4256bb1fba061e0f98a720aa7ace9f3a0e794fe8)
  • :white_check_mark: login: vishweshbankwar / name: Vishwesh Bankwar (0f5bd952cb8a6977d07c4429cb8750f65cf4346c)
  • :white_check_mark: login: Kielek / name: Piotr Kiełkowicz (14ad8c839d5f3bb332dd1de807e64244c4ca810f)

On the Tracing side, would it be better if we just finished up https://github.com/confluentinc/confluent-kafka-dotnet/pull/1838 (and also made a similar PR for Consumer)? That way we wouldn't need to wrap IProducer / IConsumer. The OpenTelemetry package would just enable the ActivitySource from the Kafka library.

cc @cijothomas

eerhardt avatar Dec 15 '23 17:12 eerhardt

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 72.40%. Comparing base (71655ce) to head (14ad8c8). Report is 349 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1493      +/-   ##
==========================================
- Coverage   73.91%   72.40%   -1.51%     
==========================================
  Files         267      304      +37     
  Lines        9615    11318    +1703     
==========================================
+ Hits         7107     8195    +1088     
- Misses       2508     3123     +615     
Flag Coverage Δ
unittests-Exporter.Geneva 53.20% <ø> (?)
unittests-Exporter.InfluxDB 95.88% <ø> (?)
unittests-Exporter.Instana 71.24% <ø> (?)
unittests-Exporter.OneCollector 91.29% <ø> (?)
unittests-Exporter.Stackdriver 75.73% <ø> (?)
unittests-Extensions 79.33% <ø> (?)
unittests-Extensions.AWS 83.41% <ø> (?)
unittests-Extensions.Enrichment 100.00% <ø> (?)
unittests-Instrumentation.AWS 80.59% <ø> (?)
unittests-Instrumentation.AWSLambda 87.96% <ø> (?)
unittests-Instrumentation.AspNet 74.55% <ø> (?)
unittests-Instrumentation.AspNetCore 85.27% <ø> (?)
unittests-Instrumentation.ElasticsearchClient 79.87% <ø> (?)
unittests-Instrumentation.EntityFrameworkCore 55.49% <ø> (?)
unittests-Instrumentation.EventCounters 76.36% <ø> (?)
unittests-Instrumentation.GrpcNetClient 79.61% <ø> (?)
unittests-Instrumentation.Hangfire 93.58% <ø> (?)
unittests-Instrumentation.Http 81.08% <ø> (?)
unittests-Instrumentation.Owin 83.43% <ø> (?)
unittests-Instrumentation.Process 100.00% <ø> (?)
unittests-Instrumentation.Quartz 78.94% <ø> (?)
unittests-Instrumentation.Runtime 100.00% <ø> (?)
unittests-Instrumentation.SqlClient 90.90% <ø> (?)
unittests-Instrumentation.StackExchangeRedis 69.92% <ø> (?)
unittests-Instrumentation.Wcf 78.47% <ø> (?)
unittests-PersistentStorage 65.78% <ø> (?)
unittests-Resources.AWS 77.30% <ø> (?)
unittests-Resources.Azure 77.94% <ø> (?)
unittests-Resources.Container 72.41% <ø> (?)
unittests-Resources.Gcp 72.54% <ø> (?)
unittests-Resources.Host 52.87% <ø> (?)
unittests-Resources.Process 81.81% <ø> (?)
unittests-Resources.ProcessRuntime 82.35% <ø> (?)
unittests-Sampler.AWS 88.09% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 323 files with indirect coverage changes

codecov[bot] avatar Dec 15 '23 18:12 codecov[bot]

On the Tracing side, would it be better if we just finished up confluentinc/confluent-kafka-dotnet#1838 (and also made a similar PR for Consumer)? That way we wouldn't need to wrap IProducer / IConsumer. The OpenTelemetry package would just enable the ActivitySource from the Kafka library.

cc @cijothomas

Yes. If that PR finishes, then there may not be even a need of an instrumentation library here, as users can just add AddSource themselves.

cijothomas avatar Dec 15 '23 18:12 cijothomas

On the Tracing side, would it be better if we just finished up confluentinc/confluent-kafka-dotnet#1838 (and also made a similar PR for Consumer)? That way we wouldn't need to wrap IProducer / IConsumer. The OpenTelemetry package would just enable the ActivitySource from the Kafka library.

cc @cijothomas

Yes. If that PR finishes, then there may not be even a need of an instrumentation library here, as users can just add AddSource themselves.

@eerhardt @cijothomas I agree but I think until that it would be great to get tracing support.

g7ed6e avatar Dec 19 '23 18:12 g7ed6e

On the Tracing side, would it be better if we just finished up confluentinc/confluent-kafka-dotnet#1838 (and also made a similar PR for Consumer)? That way we wouldn't need to wrap IProducer / IConsumer. The OpenTelemetry package would just enable the ActivitySource from the Kafka library.

cc @cijothomas

Yes. If that PR finishes, then there may not be even a need of an instrumentation library here, as users can just add AddSource themselves.

@eerhardt @cijothomas I agree but I think until that it would be great to get tracing support.

I am not quite sure if the efforts of maintaining a library is worth it, unless we know that the native tracing support will take a very long time. If it is coming in few months, wouldn't that be better to wait?

cijothomas avatar Dec 21 '23 18:12 cijothomas

On the Tracing side, would it be better if we just finished up confluentinc/confluent-kafka-dotnet#1838 (and also made a similar PR for Consumer)? That way we wouldn't need to wrap IProducer / IConsumer. The OpenTelemetry package would just enable the ActivitySource from the Kafka library.

cc @cijothomas

Yes. If that PR finishes, then there may not be even a need of an instrumentation library here, as users can just add AddSource themselves.

@eerhardt @cijothomas I agree but I think until that it would be great to get tracing support.

I am not quite sure if the efforts of maintaining a library is worth it, unless we know that the native tracing support will take a very long time. If it is coming in few months, wouldn't that be better to wait?

@cijothomas Having a look back to the original issue https://github.com/confluentinc/confluent-kafka-dotnet/issues/1269 I notice PRs to auto instrument Kafka has been linked here https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/issues/2854. I think it could make sense to offer a per producer / per consumer optin approach that fit dotnet/aspire integration https://github.com/dotnet/aspire/pull/951 which is the primary motivation of this contribution. Notice that implementation from auto instrumentation and the current PR are not aligned but it could be easily done to offer consistency.

g7ed6e avatar Dec 21 '23 22:12 g7ed6e

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Dec 29 '23 03:12 github-actions[bot]

On the Tracing side, would it be better if we just finished up confluentinc/confluent-kafka-dotnet#1838 (and also made a similar PR for Consumer)? That way we wouldn't need to wrap IProducer / IConsumer. The OpenTelemetry package would just enable the ActivitySource from the Kafka library.

cc @cijothomas

Yes. If that PR finishes, then there may not be even a need of an instrumentation library here, as users can just add AddSource themselves.

@eerhardt @cijothomas I agree but I think until that it would be great to get tracing support.

I am not quite sure if the efforts of maintaining a library is worth it, unless we know that the native tracing support will take a very long time. If it is coming in few months, wouldn't that be better to wait?

@cijothomas Having a look back to the original issue confluentinc/confluent-kafka-dotnet#1269 I notice PRs to auto instrument Kafka has been linked here open-telemetry/opentelemetry-dotnet-instrumentation#2854. I think it could make sense to offer a per producer / per consumer optin approach that fit dotnet/aspire integration dotnet/aspire#951 which is the primary motivation of this contribution. Notice that implementation from auto instrumentation and the current PR are not aligned but it could be easily done to offer consistency.

Hi @cijothomas. May I ask for a review ? What do you think about maintaining this code until official support comes out, then the library can be deprecated ?

g7ed6e avatar Jan 03 '24 07:01 g7ed6e

Thanks for submitting this. I'm reviewing mostly from the point of view of a contributor to messaging semantic conventions.

Hi @pyohannes Thank you for your review. I should be able to get back to this and incorporate your suggestions next week.

g7ed6e avatar Jan 15 '24 15:01 g7ed6e

On the Tracing side, would it be better if we just finished up confluentinc/confluent-kafka-dotnet#1838 (and also made a similar PR for Consumer)? That way we wouldn't need to wrap IProducer / IConsumer. The OpenTelemetry package would just enable the ActivitySource from the Kafka library.

cc @cijothomas

Yes. If that PR finishes, then there may not be even a need of an instrumentation library here, as users can just add AddSource themselves.

@eerhardt @cijothomas I agree but I think until that it would be great to get tracing support.

I am not quite sure if the efforts of maintaining a library is worth it, unless we know that the native tracing support will take a very long time. If it is coming in few months, wouldn't that be better to wait?

@cijothomas Having a look back to the original issue confluentinc/confluent-kafka-dotnet#1269 I notice PRs to auto instrument Kafka has been linked here open-telemetry/opentelemetry-dotnet-instrumentation#2854. I think it could make sense to offer a per producer / per consumer optin approach that fit dotnet/aspire integration dotnet/aspire#951 which is the primary motivation of this contribution. Notice that implementation from auto instrumentation and the current PR are not aligned but it could be easily done to offer consistency.

Hi @cijothomas. May I ask for a review ? What do you think about maintaining this code until official support comes out, then the library can be deprecated ?

Do you know any rough timelines when the official support comes ? If it is in less than 6 months, the maintainence+deprecation overhead here won't feel like worth.. If its an year or so at least before official support comes, it is okay to keep this PR moving.

cijothomas avatar Jan 22 '24 19:01 cijothomas

On the Tracing side, would it be better if we just finished up confluentinc/confluent-kafka-dotnet#1838 (and also made a similar PR for Consumer)? That way we wouldn't need to wrap IProducer / IConsumer. The OpenTelemetry package would just enable the ActivitySource from the Kafka library.

cc @cijothomas

Yes. If that PR finishes, then there may not be even a need of an instrumentation library here, as users can just add AddSource themselves.

@eerhardt @cijothomas I agree but I think until that it would be great to get tracing support.

I am not quite sure if the efforts of maintaining a library is worth it, unless we know that the native tracing support will take a very long time. If it is coming in few months, wouldn't that be better to wait?

@cijothomas Having a look back to the original issue confluentinc/confluent-kafka-dotnet#1269 I notice PRs to auto instrument Kafka has been linked here open-telemetry/opentelemetry-dotnet-instrumentation#2854. I think it could make sense to offer a per producer / per consumer optin approach that fit dotnet/aspire integration dotnet/aspire#951 which is the primary motivation of this contribution. Notice that implementation from auto instrumentation and the current PR are not aligned but it could be easily done to offer consistency.

Hi @cijothomas. May I ask for a review ? What do you think about maintaining this code until official support comes out, then the library can be deprecated ?

Do you know any rough timelines when the official support comes ? If it is in less than 6 months, the maintainence+deprecation overhead here won't feel like worth.. If its an year or so at least before official support comes, it is okay to keep this PR moving.

I have no idea. The issue tracking this feature is https://github.com/confluentinc/confluent-kafka-dotnet/issues/1269 and is not updated. It looks like this feature is not planned yet.

g7ed6e avatar Jan 25 '24 08:01 g7ed6e

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Feb 02 '24 03:02 github-actions[bot]

@g7ed6e Is it possible for you to follow up with owners on the status of https://github.com/confluentinc/confluent-kafka-dotnet/issues/1269 once more? Having the instrumentation supported natively is the ideal scenario and many libraries are moving towards that. If the feature is not planned then we can consider adding this instrumentation here.

Note that you would need to follow the process outlined to get started. Also could you clarify who will be the owner(s) for this component.

cc: @open-telemetry/dotnet-contrib-maintainers

vishweshbankwar avatar Feb 06 '24 21:02 vishweshbankwar

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Feb 14 '24 03:02 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Feb 22 '24 03:02 github-actions[bot]

@vishweshbankwar / @open-telemetry/dotnet-contrib-maintainers

May i ask you to reopen this PR ? I'm currently still following up the status with owners.

g7ed6e avatar Mar 01 '24 09:03 g7ed6e

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Mar 09 '24 03:03 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Mar 17 '24 03:03 github-actions[bot]

@vishweshbankwar @cijothomas latest move in https://github.com/confluentinc/confluent-kafka-dotnet/issues/1269 states that tracing won't be covered. Should we reconsider adding tracing through this repo?

g7ed6e avatar May 29 '24 15:05 g7ed6e

@vishweshbankwar @cijothomas latest move in confluentinc/confluent-kafka-dotnet#1269 states that tracing won't be covered. Should we reconsider adding tracing through this repo?

@g7ed6e - Thanks for following up.

Should we reconsider adding tracing through this repo?

Sounds good to me. Assuming you would be owing this package?

vishweshbankwar avatar May 29 '24 16:05 vishweshbankwar

@vishweshbankwar @cijothomas latest move in confluentinc/confluent-kafka-dotnet#1269 states that tracing won't be covered. Should we reconsider adding tracing through this repo?

@g7ed6e - Thanks for following up.

Should we reconsider adding tracing through this repo?

Sounds good to me. Assuming you would be owing this package?

Thanks @vishweshbankwar. I rebased the branch and joined the slack otel-dotnet channel.

g7ed6e avatar May 29 '24 19:05 g7ed6e

@Kielek I'm having issues sharing services (InstrumentedConsumerBuilder<TKey, TValue>, InstrumentedProducerBuilder<TKey, TValue> and their respective named options) between IServiceProvider instances of MeterProviderBuilder and TracerProviderBuilder.. it looks like each builder has its own IServiceCollection, I'm thinking to move InstrumentedXXXBuilder<TKey, TValue> instances in static instances of ConcurrentDictionary<(string name, Type KeyType, Type ValueType), XXXBuilder<TKey, TValue>>. i will give this a try.

edit: @CodeBlanch is there a way to share services between MeterProviderBuilder and TracerProviderBuilder ?

g7ed6e avatar Jun 06 '24 15:06 g7ed6e

@g7ed6e

edit: @CodeBlanch is there a way to share services between MeterProviderBuilder and TracerProviderBuilder ?

The short answer is: No

The long answer is: Depends

For users doing...

appBuilder.Services.AddOpenTelemetry()
   .WithTracing(tracing => tracing.AddKafkaConsumerInstrumentation())
   .WithMetrics(metrics => metrics.AddKafkaConsumerInstrumentation());

The IServiceCollection + IServiceProvider will be the one tied to host and everything is shared.

For users doing...

using var tracerProvider = Sdk.CreateTracerProviderBuilder()
  .AddKafkaConsumerInstrumentation()
  .Build();

using var meterProvider = Sdk.CreateMeterProviderBuilder()
  .AddKafkaConsumerInstrumentation()
  .Build();

They are independent.

Hopefully there will be a new API soon (https://github.com/open-telemetry/opentelemetry-dotnet/pull/5325) for that second case which will also be shared:

using var openTelemetry = OpenTelemetrySdk.Create(builder => builder
   .WithTracing(tracing => tracing.AddKafkaConsumerInstrumentation())
   .WithMetrics(metrics => metrics.AddKafkaConsumerInstrumentation()));

If you really need them to be singletons then you'll have to do something static. If having singletons is just a nice-to-have I would just use the IServiceProvider normally and let it sort out what to do.

Remember users can set up multiple pipelines:

using var tracerProvider1 = Sdk.CreateTracerProviderBuilder()
  .AddSource("SomeHighPrioritySource")
  .AddKafkaConsumerInstrumentation()
  .AddOtlpExporter(...send one place...)
  .Build();

using var tracerProvider2 = Sdk.CreateTracerProviderBuilder()
  .AddSource("SomeLowPrioritySource")
  .AddKafkaConsumerInstrumentation()
  .AddZipkinExporter(...send some other place...)
  .Build();

In which case would having static singletons actually be a problem?

CodeBlanch avatar Jun 12 '24 17:06 CodeBlanch

@g7ed6e I pushed some changes to try and get this over the finish line 😄

  • Missing MinVerTagPrefix in the .proj: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/1493/commits/f3c4df8b5e88dcee7899297011b71478f98c9148

  • Static meters/counters: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/1493/commits/a4fc992e507bea855489b7034b8129a8ec0c6120

    The way it was before users could end up with multiple meters in their process with the same name & version. That is a no-no in the OTel spec. SDK should actually be fine with it but it seemed simpler to just make them static.

    Since I was in there I also switched the measurements to use TagList so they become allocation free. Before for every measurement it was allocating an enumerator and array. Expensive!

  • Lit up the integration tests in CI: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/1493/commits/400fdc15ccc3521bac45b6fa64f873dd402c77b9

    The tests are currently failing. Would you please look into that?

@Kielek @vishweshbankwar Some concerns with the tests being added on this PR. There are no unit tests, only integration tests. Usually we have a mix of both? Not a blocker for me so long as the CI runs what is there but I think ideally there would be some unit tests too. Could be a follow-up effort.

CodeBlanch avatar Jun 25 '24 19:06 CodeBlanch

@g7ed6e I pushed some changes to try and get this over the finish line 😄

  • Missing MinVerTagPrefix in the .proj: f3c4df8
  • Static meters/counters: a4fc992 The way it was before users could end up with multiple meters in their process with the same name & version. That is a no-no in the OTel spec. SDK should actually be fine with it but it seemed simpler to just make them static. Since I was in there I also switched the measurements to use TagList so they become allocation free. Before for every measurement it was allocating an enumerator and array. Expensive!
  • Lit up the integration tests in CI: 400fdc1 The tests are currently failing. Would you please look into that?

@Kielek @vishweshbankwar Some concerns with the tests being added on this PR. There are no unit tests, only integration tests. Usually we have a mix of both? Not a blocker for me so long as the CI runs what is there but I think ideally there would be some unit tests too. Could be a follow-up effort.

@CodeBlanch I pulled your changes and pushed updates to the integration tests.

g7ed6e avatar Jun 26 '24 07:06 g7ed6e

@g7ed6e Seems like the integration tests are still having an issue: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/actions/runs/9683812113/job/26720211596?pr=1493

CodeBlanch avatar Jun 26 '24 17:06 CodeBlanch

@g7ed6e Seems like the integration tests are still having an issue: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/actions/runs/9683812113/job/26720211596?pr=1493

~~@CodeBlanch it looks like multiple Kafka containers are created. I don't face this issue locally I will check CI scripts.~~

@CodeBlanch : I fixed and ran the integration tests locally using the compose file and that's ok. It should be fine in CI too.

g7ed6e avatar Jun 26 '24 17:06 g7ed6e

I do still think we could do more to help connect traces when consuming (#1493 (comment)) but LGTM that could be done as follow-up work.

@CodeBlanch I missed that comment. I propose to implement it in a follow up PR.

g7ed6e avatar Jun 28 '24 12:06 g7ed6e

@Kielek @vishweshbankwar I think this is good to go but since it is touching a lot of the repo I would like another maintainer to approve before merging.

CodeBlanch avatar Jun 28 '24 17:06 CodeBlanch