opentelemetry-js icon indicating copy to clipboard operation
opentelemetry-js copied to clipboard

fix(sdk-trace-base): eager exporting for batch span processor

Open seemk opened this issue 1 year ago • 5 comments

Which problem is this PR solving?

Fixes BSP silently dropping spans.

Fixes #3094

Short description of the changes

Add eager exporting to BSP:

  1. Start an export as soon as the max batch size is reached after a span has been added to the queue.
  2. The usual export loop runs as-is based on the BSP delay interval.

Additionally:

  • Unref the timeout timer when exporting.

Type of change

  • [x] Bug fix (non-breaking change which fixes an issue)

Checklist:

  • [x] Followed the style guidelines of this project
  • [x] Unit tests have been added

seemk avatar Nov 30 '22 18:11 seemk

Codecov Report

Merging #3458 (f2b8d51) into main (2499708) will increase coverage by 1.90%. The diff coverage is 100.00%.

:exclamation: Current head f2b8d51 differs from pull request most recent head 68247b3. Consider uploading reports for the commit 68247b3 to get more accurate results

@@            Coverage Diff             @@
##             main    #3458      +/-   ##
==========================================
+ Coverage   90.52%   92.42%   +1.90%     
==========================================
  Files         159      326     +167     
  Lines        3757     9284    +5527     
  Branches      835     1967    +1132     
==========================================
+ Hits         3401     8581    +5180     
- Misses        356      703     +347     
Files Coverage Δ
...dk-trace-base/src/export/BatchSpanProcessorBase.ts 94.16% <100.00%> (+1.24%) :arrow_up:

... and 178 files with indirect coverage changes

codecov[bot] avatar Nov 30 '22 18:11 codecov[bot]

Reopening after dealing with browser issues

seemk avatar Nov 30 '22 18:11 seemk

I remember there was some discussion about this in the spec. Do you know if that was ever resolved or what the current state of that is?

dyladan avatar Dec 07 '22 16:12 dyladan

Here is the spec issue https://github.com/open-telemetry/opentelemetry-specification/issues/849

dyladan avatar Dec 07 '22 19:12 dyladan

https://github.com/open-telemetry/opentelemetry-specification/pull/3024/files

dyladan avatar Dec 07 '22 19:12 dyladan

We encountered this issue too. @dyladan @seemk @MSNev Is there something blocking it from being merged? @TimurMisharin FYI

arbiv avatar Jan 16 '23 11:01 arbiv

We encountered this issue too. @dyladan @seemk @MSNev Is there something blocking it from being merged? @TimurMisharin FYI

Not that I know of, just needs reviews

seemk avatar Jan 16 '23 19:01 seemk

Nothing blocking for me I'm just on vacation. Spec merged last week

dyladan avatar Jan 16 '23 19:01 dyladan

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

cftechwiz avatar Feb 03 '23 17:02 cftechwiz

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

dyladan avatar Feb 03 '23 20:02 dyladan

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.

MSNev avatar Feb 03 '23 20:02 MSNev

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.

This definitely fits the narrative/issue we're trying to resolve. @dyladan can you provide an ETA on this?

cftechwiz avatar Feb 06 '23 17:02 cftechwiz

@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.

I had been waiting for a resolution of the conversation between you and @MSNev

And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.

This definitely fits the narrative/issue we're trying to resolve. @dyladan can you provide an ETA on this?

I can update my PR either tomorrow or on Wednesday, however getting rid of a free running timer on a 5 second interval seems to be a microoptimisation, is there even any data to back these performance concerns?

seemk avatar Feb 06 '23 18:02 seemk

I can update my PR either tomorrow or on Wednesday, however getting rid of a free running timer on a 5 second interval seems to be a microoptimisation, is there even any data to back these performance concerns?

I don't have any public data that I can provide, but when a client has a running timer it restricts the ability of that process being put to sleep/reduce power usage state and therefore to reduce the amount of energy that is being burnt.

This issue mostly affects Clients (browser / mobile devices), but will also indirectly affect hosted node instances that will be constantly burning CPU even when it has no events/traffic -- thus costing users $$$

MSNev avatar Feb 17 '23 23:02 MSNev

I am bumping this thread as it seems to have gone a bit dormant. We know of production impact in NodeJS/GraphQL that this PR resolves. In order to accelerate the customer and get them unblocked from leverage Batch Processor we had them fork the repo, rebase the branch and build it. They did this, and I can report that this has resolved the issue and they are no longer dropping spans. Also, we have not identified any of the issues related to performance concerns being discussed here. We're in agreement that the timer is a micro-optimization and do not believe it should hold this up any longer. If we want to continue debating the timer, I propose we raise a new issue and let this one pass.

If anyone would like more information, let me know what you want and I will try to provide it. We'd be really grateful if we could go ahead and see this make its way to being merged soon so we can pick up the fix and get them back into the main branch...

Cheers, Colin

cftechwiz avatar Mar 30 '23 17:03 cftechwiz

@cfallwell my apologies I seem to have let this fall through the cracks. In the absence of public data from @MSNev I think I agree we should move forward. Any alternative implementation suggestions are welcome or concrete data to quantify the performance impact.

I'll put this on my schedule to review tomorrow.

dyladan avatar Apr 24 '23 15:04 dyladan

My original comments still stand, in fact I have had to fix issues related to "background" timers this week for Application Insights because it kept causing the app to "wake up" the App and the burn CPU for no reason (as 99% of the time the queue was empty). This issue was raised by one of our internal partners who have an extremely robust perp framework and runtime measurements.

Also by NOT having a running timer in the background (and assuming that the timeout period is not too long), the need to unref the timer diminishes as there is no running timer to stop the app from closing.

Using standard setInterval / clearInterval (I could not change to setTimeout in our master repo as we return the interval id as part of the API. https://github.com/microsoft/ApplicationInsights-JS/commit/0d970c7d7ff63dd1f924771e69d76ff00953bab3

But in the current release, we use a wrapper (which works for any runtime (Node, Browser, worker) https://github.com/microsoft/ApplicationInsights-JS/commit/d14c15bfe7cd4e8274b5b811c27711b85b000dd3 while this does use unref this is because this background timer is generally only used for internal non-critical log reporting.

Both of these just pump the "log" event into our export pipeline (which has it's own timer (which is only running when there is something waiting))

MSNev avatar Apr 24 '23 16:04 MSNev

#3958 solved this

dyladan avatar Oct 06 '23 14:10 dyladan