opentelemetry-dotnet icon indicating copy to clipboard operation
opentelemetry-dotnet copied to clipboard

Migrate BatchExporterProcessor to async

Open verdie-g opened this issue 1 year ago • 7 comments

Fixes #5778

Changes

This makes BatchExporterProcessor async to avoid having dedicated threads that are idle 99% of the time.

I first tried to make the implementation close to the original synchronous implementation where only the ExportProc method was allowed to call BaseExporter.ExportAsync but it was a unreviewable deadlock fest. This proposed implementation allows any method (Export, Flush, Shutdown) to also call ExportAsync but always limit a single concurrent export. This greatly simplifies the code and only took me 30 minutes to implement.

It is very important that we first discuss about how BaseExporter.ExportAsync will be implemented on all available exporters before merging this change, otherwise, this is going to push synchronous job on the thread pool which will make the situation worse than before.

Merge requirement checklist

  • [ ] CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
  • [ ] Unit tests added/updated
  • [ ] Appropriate CHANGELOG.md files updated for non-trivial changes
  • [ ] Changes in public API reviewed (if applicable)

verdie-g avatar Sep 16 '24 23:09 verdie-g

Can you describe the overall idea on how this will be implemented? This a core functionality, and there cannot be any breaking changes at all. It is intentional that the current design spins up own thread and not rely on async tasks/thread-pool. And also the benefits of this change.

cijothomas avatar Sep 17 '24 01:09 cijothomas

The exporter loop is kept but is now running on the thread pool using a simple Task.Run. The Export, Flush, and Shutdown operations were allowed to speed up the exporting by disabling the delay between two exports. In this change, it's done differently to simplify the code. The 3 operations are allowed to call the BaseExporter.ExportAsync method but only one concurrent export is allowed. BaseExporter.ExportAsync is a new method that by default does async-over-sync by calling Export just to avoid a breaking change. Though, it is important that this method gets overridden by the different implementations (zipkin, OTEL) to avoid pushing synchronous operations on the thread pool.

The idea of a thread pool is to avoid the cost of the creation/deletion of a thread as well the cost of the resources associated to the thread. But most importantly, it avoids having to perform a context switch to yield the execution to another task. In an ideal application, the thread pool has as many thread as CPU cores, and each thread never leaves its core. Each context switch will impact the throughput of the thread pool. If there is only a few extra threads used, it's fine, but if all libraries start spawning dedicated threads for their background jobs that can be an issue. Here, 3 threads are spawned, and maybe a 4th one will be added for the profiling signal. I won't have any figures to share but the OTEL library could become a better citizen in the .NET ecosystem by using the thread pool.

It is also important to note that currently, only on a part of the work done by the export is performed on the dedicated thread. When HttpClient is used, or any other network class, the thread pool will still be used.

Additionally, this will ease having an asynchronous flush method in the future, which I believe a user request will appear some day about that as you never want to block the thread pool and currently it does.

BONUS1: we get to remove the synchronousSendSupportedByCurrentPlatform. BONUS2: might fix https://github.com/open-telemetry/opentelemetry-dotnet/issues/2816

verdie-g avatar Sep 17 '24 02:09 verdie-g

It is synchronous. And the specification says it must be so.

I don't think that's true. Here is a quote from the spec:

Depending on the implementation the result of the export may be returned to the Processor not in the return value of the call to Export() but in a language specific way for signaling completion of an asynchronous task.

Because of that we're always going to run into somewhere having to jump from sync to async back to sync.

I've started implementing the ExportAsync method around, and found cases where we need to synchronously wait on a Task but I don't think I found a case where a synchronous operation occurs after an asynchronous one. Actually, I'm not sure what you mean. Could you clarify that point?

The telemetry pipeline being synchronous is a good thing. We don't want to block real work being done on a thread because it tried to write some telemetry and the SDK has to wait on a busy thread pool.

When a context switch occurs to work on the OTEL thread, it will actually prevent a thread pool thread to do its job. Using the thread-pool and async exports would greatly reduce that overhead.

How will SimpleExporterProcessor work with ExportAsync? Are we expecting SimpleExporterProcessor to continue calling Export or will it call ExportAsync?

Either should be fine. If a network exporter is used with SimpleExporterProcessor, there is obviously no guarantees about the performance, so I don't think ExportAsync will change anything here.

It is also worth noting that because of the synchronous export API, some exporters need to do some sync-over-async on some platforms.

Depending on that answer, can we see some of the existing exporters (maybe start with Otlp) updated as a demonstration? We require Export be implemented via abstract. Authors will have to code something synchronous. If we offer ExportAsync how does that help with this? I don't want to see this...

For OTLP, the Export method will indeed be doing sync-over-async on the ExportAsync. But it does not matter because only ExportAsync would be called.

What I'm trying to figure out is, if I'm an exporter author, and I have to do a synchronous version anyway, why would I bother with an asynchronous version?

I think it's very similar to how the Stream API works in .NET. If we focus on the read API, by default, you only need to implement Read(byte[] buffer, int offset, int count). Other methods, such as ReadByte or ReadAsync will fallback on Read. The stream will work fine but you'll get a performance hit until you implement the other methods.

Here is a very concrete example https://github.com/dotutils/streamutils/pull/2 where some benchmarks showed that this Stream was a performance bottleneck, so extra Read methods were implemented to fix that.

In the case of ExportAsync it will be the same. Though, it is to expect to have way less implementations of BaseExporter than implementations of Stream in the wild. Also, most of the BaseExporter implementations are under our control in the OTEL repositories.

verdie-g avatar Sep 18 '24 18:09 verdie-g

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] avatar Sep 26 '24 03:09 github-actions[bot]

Here is what the implementations of BaseExport.ExportAsync could look like: https://github.com/verdie-g/opentelemetry-dotnet/commit/02eb4503bcd5dfcb415351f1791ddc3b913eb5a7

verdie-g avatar Sep 27 '24 22:09 verdie-g

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] avatar Oct 06 '24 03:10 github-actions[bot]

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] avatar Oct 14 '24 03:10 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Oct 21 '24 03:10 github-actions[bot]

Kind of awkward that the PR gets automatically closed when it's waiting for a review 🤔 @CodeBlanch @cijothomas could you have a look at it again? The TL;DR is that you don't create threads in .NET. Some platforms don't even support it. And OTEL, which will be probably become the most downloaded nuget, creates 3-4 of them when it could use the thread pool.

verdie-g avatar Oct 21 '24 14:10 verdie-g

Sorry @verdie-g the short version is this isn't a priority right now. Sorry! That doesn't mean we don't ❤️ the passion and the contribution. We really do, we've just got a lot to get done before we release 1.10.0 stable next month. This is a big change so I'm not going to consider it for inclusion this late in the cycle. We can come back to it for the next release though.

CodeBlanch avatar Oct 21 '24 17:10 CodeBlanch

Could be amazing to have this merged at some point. A bit frustrating that OTEL doesn't even work with Blazor because of missing threading in the browser :)

madskonradsen avatar Nov 15 '24 19:11 madskonradsen

@CodeBlanch Any chance this can be reopened now?

alexaka1 avatar Nov 21 '24 14:11 alexaka1

Reopen to bring @CodeBlanch attention;)

Kielek avatar Nov 21 '24 14:11 Kielek

I've added a boolean to the BatchExportProcessor constructor so by default the behavior doesn't change. Tell me what you think and I'll add the tests.

I've first tried to create a new async implementation of BaseExportProcessor but that would cause a breaking change when inheriting from that one instead of BatchExportProcessor.

Alternatively we could also use AppContext.TryGetSwitch to have a global flag instead of a by-instance one.

verdie-g avatar Nov 23 '24 15:11 verdie-g

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] avatar Dec 01 '24 03:12 github-actions[bot]

Still awaiting review.

madskonradsen avatar Dec 01 '24 06:12 madskonradsen

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] avatar Dec 12 '24 03:12 github-actions[bot]

Still not stale

madskonradsen avatar Dec 12 '24 05:12 madskonradsen

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] avatar Dec 21 '24 03:12 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Dec 28 '24 03:12 github-actions[bot]

Afaik this PR is ready to review. Can we reopen it again @Kielek?

alexaka1 avatar Dec 29 '24 10:12 alexaka1

I really really hate to pry, but I believe this PR fell through the cracks and that's the only reason it was closed. Not because it is not needed. @CodeBlanch @Kielek

alexaka1 avatar Mar 08 '25 21:03 alexaka1