opentelemetry-dotnet-contrib icon indicating copy to clipboard operation
opentelemetry-dotnet-contrib copied to clipboard

OpenTelemetry .NET Tracing AspNetCore instrumentation is inefficient

Open dpk83 opened this issue 2 years ago • 20 comments

Feature Request

It's more of an issue than a feature request. We care a lot about performance for our services and in our local benchmarking we found that AspNetCoreInstrumentation for tracing is extremely inefficient. In a simple AspNetCore application with an extremely simple rest endpoint enabling tracing reduces the service throughput by around 15% which is huge for serving telemetry. (Drops from 10K RPS to 8.6K RPS)

Is your feature request related to a problem?

Yes, telemetry collection should be high performance and shouldn't impact service throughput.

Describe the solution you'd like:

Improve the performance of OTel instrumentation so it's not impacting service throughput.

Describe alternatives you've considered.

A custom middleware which is only using resources for processing what's essential on top of what AspNetCore out of box tracing instrumentation already provides.

Additional Context

The code for application used to measure the throughput and the results can be found here https://github.com/dpk83/otelperfimpactanalysis/tree/main/TelemetryPerfImpactAnalysis

dpk83 avatar Feb 22 '23 17:02 dpk83

To add:

  • This benchmarking is done without any exporter (Adding exporter will likely reduce the throughput even further)
  • The requests also didn't have any baggage. I suspect requests with baggage will likely reduce the throughput even further.

dpk83 avatar Feb 22 '23 17:02 dpk83

https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Instrumentation/AspNetCoreInstrumentationBenchmarks.cs

The benchmarks above show a totally different story. The proper way to test overall web app throughput reduction is likely by using bombardier similar tools which simulates real scenarios. Maybe add new perf tests using new approach first, and keep benchmark.net for micro optimizations/measurements only.

cijothomas avatar Feb 22 '23 17:02 cijothomas

@dpk83 there's a couple of point here that I think are interesting, I've done very specific analysis on this along side the benchmarks in the repo.

Your analysis of the middleware you added seems to suggest that it's a like-for-like replacement, which it is not. There is much more done to create a full activity, including things like trace propagation and also filling out additional properties too. That's not to say that it isn't valid that there should be a smaller instrumentation for those that want it, just that it's not a like-for-like comparison.

Your middleware approach does nothing with the Activity if I'm looking at it correctly. This means that, ultimately, there is no "work" being done, and therefore all you're testing is that you can add a tag and that it doesn't affect the load. I would highly recommend showing this as a full implementation. Specifically, use .AddSource("'Microsoft.AspNet") with the sdk instead of your ActivityListener to get a closer comparison as that would show whether you're saying it's OpenTelemetry, or specifically the Instrumentation as you suggest.

Baggage will cause some additional load as it has to be copied and put onto the outbound requests. This is more to do with Propagation than it is to do with AspNetCore instrumentation.

Exporters, as long as they're using Batch, shouldn't affect inline performance, but will increase memory usage.

I think "extremely" in the title here is incendiary and not required. This is quite an aggressive post, I'm hoping that it's coming from a place of frustration and not a criticism. If there are some optimizations you can see to the way it works, then I'd recommend working with the team to get those optimizations made for the benefit of everyone.

Telemetry is always additional processing, that's something that everyone needs to understand, and therefore will always cause some degree of degradation in throughput/latency and/or increase in memory. There's a trade off here, do you fly blind without tracing so you can have a smaller amount of servers running your application? or take the hit of adding an additional server and/or scaling up so you can have the telemetry.

martinjt avatar Mar 03 '23 14:03 martinjt

There is much more done to create a full activity, including things like trace propagation and also filling out additional properties too. That's not to say that it isn't valid that there should be a smaller instrumentation for those that want it, just that it's not a like-for-like comparison. Your middleware approach does nothing with the Activity if I'm looking at it correctly. This means that, ultimately, there is no "work" being done, and therefore all you're testing is that you can add a tag and that it doesn't affect the load. I would highly recommend showing this as a full implementation. Specifically, use .AddSource("'Microsoft.AspNet") with the sdk instead of your ActivityListener to get a closer comparison as that would show whether you're saying it's OpenTelemetry, or specifically the Instrumentation as you suggest.

In the current benchmarking none of the additional functionality beyond what's already provided by AspNetCore is used, so it is actually a like-for-like comparison for the required functionality and I don't expect OpenTelemetry to have additional overhead on top of it. I intentionally kept is simple because that's going to be the case for majority of the cases where baggage isn't modified or the tracing isn't going through multiple different vendor implementations. Also, leave the middleware aside, just comparing the throughput for binary scenarios i.e. tracing disabled vs tracing enabled itself is sufficient to the message I am trying to put forth.

Baggage will cause some additional load as it has to be copied and put onto the outbound requests. This is more to do with Propagation than it is to do with AspNetCore instrumentation.

Yeah, I intentionally left out baggage in the benchmarking because I wanted to make a point here for the very basic functionality where I don't expect any additional overhead on top of AspNetCore. Regarding baggage it will cause further reduction in throughput. The way currently baggage part is designed in the library can definitely be improved further. For our cases >95% of the cases will not modify the baggage as the request is passing through the system, so transforming the baggage header into the object and then retransforming it back to it's way out is clearly an unnecessary overhead. Such processing should be moved to more on-demand basis (i.e. on the first such action where it's required). I also believe there are additional opportunities to extract more performance out of it even for cases where baggage needs to be transformed and used.

I think "extremely" in the title here is incendiary and not required. This is quite an aggressive post,

That's fair, I will remove the extremely word. But to me telemetry should have minimal overhead and this is actually quite a lot of overhead.

Telemetry is always additional processing, that's something that everyone needs to understand, and therefore will always cause some degree of degradation in throughput/latency and/or increase in memory. There's a trade off here, do you fly blind without tracing so you can have a smaller amount of servers running your application? or take the hit of adding an additional server and/or scaling up so you can have the telemetry.

Yes it's additional processing but that additional processing should be minimal so my services can utilize the system's resources efficiently for the actual business logic. Why would one use OpenTelemetry .NET library if they are able to achieve the same result with way lower processing and cost implications by writing their own solution or some other solution? For smaller services or adhoc usage it's fine but for large complex services which are processing millions and billions of requests a day telemetry costs isn't insignificant and matters a lot.

dpk83 avatar Mar 07 '23 21:03 dpk83

Adding this issue to the Instrumentation-1.0.0 milestone - we cannot release 1.0 until the perf is acceptable (and prove that with actual numbers).

Few things:

  1. Decide how to measure performance for instrumentations.
    • The existing benchmarks approach https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/Benchmarks/Instrumentation/AspNetCoreInstrumentationBenchmarks.cs
    • New approach using similar techniques used by @dpk83.
    • Have both. (I personally prefer this)
  2. Decide the acceptable perf overhead. We could find the theoretical max (by doing a custom middleware populating activities with right tags etc.) and add OTel tax.
  3. Do the actual fixes to improve perf. @vishweshbankwar has done a lot in this area in the past, but pretty sure we can do more.
    • We iterate through activity tag (more than once) to find things. Might be avoidable.
    • The Propagator might be doing duplicate works. We need to identify where the perf cost is coming from. Are they from the DS callbacks fired by asp.net core itself? OR in our instrumentation code? Or OTel SDK (It can't be otel sdk, as benchmarks show SDK perf is very good).

@dpk83 would you have time to send a PR to this repo with perf test instructions using bombardier? Its okay to be manual at first, we can see how to automate it to orchestrate and run and show results, just like benchmark tool.

cijothomas avatar Mar 22 '23 18:03 cijothomas

@dpk83 There is a lot of work that has to be done in the AspNetCoreInstrumention to support OTel baggage because .NET doesn't do it.

The requests also didn't have any baggage. I suspect requests with baggage will likely reduce the throughput even further.

If you aren't using baggage, try doing this...

// When the default propagator is just w3c we respect the work AspNetCore does instead of calling the SDK logic.
// See: https://github.com/open-telemetry/opentelemetry-dotnet/blob/e9226cfadf3eb4b58fb8c087c61291a3569ccea3/src/OpenTelemetry.Instrumentation.AspNetCore/Implementation/HttpInListener.cs#L133
Sdk.SetDefaultTextMapPropagator(new TraceContextPropagator());

That should make the OTel perf more in line with your middleware. It is still doing more work, so it will be more of a hit, but it should help a lot.

If you do want to use OTel baggage in your services, fix your middleware because it makes your tests invalid 😉

CodeBlanch avatar Mar 22 '23 19:03 CodeBlanch

If you aren't using baggage, try doing this... // When the default propagator is just w3c we respect the work AspNetCore does instead of calling the SDK logic. // See: https://github.com/open-telemetry/opentelemetry-dotnet/blob/e9226cfadf3eb4b58fb8c087c61291a3569ccea3/src/OpenTelemetry.Instrumentation.AspNetCore/Implementation/HttpInListener.cs#L133 Sdk.SetDefaultTextMapPropagator(new TraceContextPropagator());

This is good to know but I think there are opportunities to improve the baggage handling as well.

  • First creating it on demand as opposed to always (i.e. create it only if an API that's interacting with baggage is invoked by some operation)
  • Same goes for the outgoing side where if the baggage was never used/modified that you can avoid all the creation/serialization stuff and just pass the header from incoming request as is.
  • Even when baggage is created briefly looking at the code it looks like there is room for improvements and treating baggage as a string might actually get you better perf, for majority of the audience (assuming even for the cases where baggage is modified the number of entries added/removed are low)

dpk83 avatar Mar 22 '23 21:03 dpk83

Great ideas to improve baggage! Mind spinning up an issue to track that?

For the case here where baggage isn't being used (probably very common case) what might be more immediately beneficial is for OTel .NET to use the .NET propagation API (somehow) to avoid the duplicate extraction/injection and activity creation. IIRC that is one thing @vishweshbankwar was looking into.

CodeBlanch avatar Mar 22 '23 22:03 CodeBlanch

Sdk.SetDefaultTextMapPropagator(new TraceContextPropagator());

@CodeBlanch I tried this and it's actually causing a different activity everytime. All the exported chained activities have different traceId.

dpk83 avatar Mar 22 '23 23:03 dpk83

Sdk.SetDefaultTextMapPropagator(new TraceContextPropagator());

@CodeBlanch I tried this and it's actually causing a different activity everytime. All the exported chained activities have different traceId.

Please open a new issue with this one, so we dont add noise to this particular issue to track performance improvements.

cijothomas avatar Mar 23 '23 00:03 cijothomas

Created open-telemetry/opentelemetry-dotnet#4326

dpk83 avatar Mar 23 '23 00:03 dpk83

This is an important issue so checking for the status here. It is on track for getting fixed in the upcoming stable OpenTelemetry release, right?

dpk83 avatar May 04 '23 16:05 dpk83

This is an important issue so checking for the status here. It is on track for getting fixed in the upcoming stable OpenTelemetry release, right?

If you are referring to the 1.5 Stable Release (that is the next stable), then no. This issue is not planned for that milestone.

This is something to be addressed definitely for the Stable release milestone. No firm ETA on that yet.

cijothomas avatar May 05 '23 15:05 cijothomas

@reyang This is an important ask and I was expecting this to be addressed in 1.5 stable release. We have been pushing services across our org to adopt distributed tracing and the perf penalty here is unacceptable. We definitely want to avoid creating our own solution here because OpenTelemetry didn't provide the performant solution.

dpk83 avatar May 05 '23 16:05 dpk83

@reyang This is an important ask and I was expecting this to be addressed in 1.5 stable release. We have been pushing services across our org to adopt distributed tracing and the perf penalty here is unacceptable. We definitely want to avoid creating our own solution here because OpenTelemetry didn't provide the performant solution.

Sorry I'm confused, I think we all agree that this is part of the stable release. OpenTelemetry .NET 1.5 milestone is not planning to release a stable version of OpenTelemetry.Instrumentation.AspNetCore. Where is the expectation (that 1.5 release will ship stable OpenTelemetry.Instrumentation.AspNetCore and perf improvement will happen right now) coming from?

reyang avatar May 05 '23 16:05 reyang

Oh so you are saying this will only be available when AspNetCore instrumentation becomes stable. What are the timelines for that? Based on the timelines we will have to evaluate whether we can wait or we need to switch gears

dpk83 avatar May 05 '23 21:05 dpk83

Oh so you are saying this will only be available when AspNetCore instrumentation becomes stable.

Where is this coming from? I didn't say that.

reyang avatar May 05 '23 23:05 reyang

Okay looks like there is some problem on my side and I am not able to understand the statements properly.

Let’s simplify things: Could you please share the ETA when a stable release with the fix for this issue is available?

dpk83 avatar May 06 '23 00:05 dpk83

@cijothomas already replied https://github.com/open-telemetry/opentelemetry-dotnet-contrib/issues/1745 that we don't have the ETA yet.

reyang avatar May 06 '23 00:05 reyang

@dpk83 : Sharing an update here based on my investigation

Regarding the issue about custom Middleware being more efficient than instrumentation library, I think this is not accurate. The sample shared is not 1:1 comparison with the instrumentation library.

I ran some tests and as per my investigation, custom Middleware has slightly slower throughput compared to instrumentation library. Please see: https://github.com/open-telemetry/opentelemetry-dotnet/pull/5163

That being said there are few areas of improvements we can further consider, some of them shared in this thread above as well

  • Allow selecting particular set of tags based on user requirements. Setting fewer tags on activity could help with perf.
  • Propagation: We are currently doing this twice due to difference in Propagator APIs between OTel and .NET. This is another area where we could save some time. We are looking to get this fixed when the instrumentation is supported natively in ASP.NET Core.
  • Look for opportunities to further reduce the allocation cost associated with tag values. We have done couple of improvements in the past months but there could be more.
  • Investigate if cost associated with activity creation can be reduced in System.Diagnostics.DiagnosticSource.
  • Sampling: This needs no change in instrumentation but is another way to reduce cost associated with activity creation/enrichment.

We plan to continue the discussions/efforts on improving perf and also work with .NET team for bringing support for trace instrumentation natively in .NET9.0. ref: https://github.com/dotnet/aspnetcore/issues/52439.

vishweshbankwar avatar Dec 13 '23 17:12 vishweshbankwar

This is an ongoing task to optimize and therefore this isn't a tracking issue. If you have specific areas and suggestions to improve them, I'd suggest they're split out into separate issues for you to work.

martinjt avatar Jun 02 '24 17:06 martinjt

Ping to https://github.com/dotnet/runtime/issues/107158 Contention is visible with https://github.com/cijothomas/lock-contention-activity-repro/blob/main/lock-contention-activity-repro/Program.cs

cijothomas avatar Aug 29 '24 21:08 cijothomas

https://github.com/dotnet/runtime/pull/107333 Should fix the contention from runtime itself.

cijothomas avatar Sep 04 '24 16:09 cijothomas

Closing as https://github.com/dotnet/runtime/pull/107333 is merged, and will be part of .NET 10.

cijothomas avatar Sep 06 '24 16:09 cijothomas