opentelemetry-js
opentelemetry-js copied to clipboard
fix(sdk-trace-base): eager exporting for batch span processor
Which problem is this PR solving?
Fixes BSP silently dropping spans.
Fixes #3094
Short description of the changes
Add eager exporting to BSP:
- Start an export as soon as the max batch size is reached after a span has been added to the queue.
- The usual export loop runs as-is based on the BSP delay interval.
Additionally:
- Unref the timeout timer when exporting.
Type of change
- [x] Bug fix (non-breaking change which fixes an issue)
Checklist:
- [x] Followed the style guidelines of this project
- [x] Unit tests have been added
Codecov Report
Merging #3458 (f2b8d51) into main (2499708) will increase coverage by
1.90%
. The diff coverage is100.00%
.
:exclamation: Current head f2b8d51 differs from pull request most recent head 68247b3. Consider uploading reports for the commit 68247b3 to get more accurate results
@@ Coverage Diff @@
## main #3458 +/- ##
==========================================
+ Coverage 90.52% 92.42% +1.90%
==========================================
Files 159 326 +167
Lines 3757 9284 +5527
Branches 835 1967 +1132
==========================================
+ Hits 3401 8581 +5180
- Misses 356 703 +347
Files | Coverage Δ | |
---|---|---|
...dk-trace-base/src/export/BatchSpanProcessorBase.ts | 94.16% <100.00%> (+1.24%) |
:arrow_up: |
Reopening after dealing with browser issues
I remember there was some discussion about this in the spec. Do you know if that was ever resolved or what the current state of that is?
Here is the spec issue https://github.com/open-telemetry/opentelemetry-specification/issues/849
https://github.com/open-telemetry/opentelemetry-specification/pull/3024/files
We encountered this issue too. @dyladan @seemk @MSNev Is there something blocking it from being merged? @TimurMisharin FYI
We encountered this issue too. @dyladan @seemk @MSNev Is there something blocking it from being merged? @TimurMisharin FYI
Not that I know of, just needs reviews
Nothing blocking for me I'm just on vacation. Spec merged last week
@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.
@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.
I had been waiting for a resolution of the conversation between you and @MSNev
@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.
I had been waiting for a resolution of the conversation between you and @MSNev
And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.
@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.
I had been waiting for a resolution of the conversation between you and @MSNev
And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.
This definitely fits the narrative/issue we're trying to resolve. @dyladan can you provide an ETA on this?
@dyladan - Hope you had a good holiday.. Can you provide an ETA on getting this fix merged ? We've got a production issue this fixes.
I had been waiting for a resolution of the conversation between you and @MSNev
And my comment (I believe) still stands, we should NOT have a timer running (including an interval timer), if there is nothing "batched" as this will cause issues relating to performance etc.
This definitely fits the narrative/issue we're trying to resolve. @dyladan can you provide an ETA on this?
I can update my PR either tomorrow or on Wednesday, however getting rid of a free running timer on a 5 second interval seems to be a microoptimisation, is there even any data to back these performance concerns?
I can update my PR either tomorrow or on Wednesday, however getting rid of a free running timer on a 5 second interval seems to be a microoptimisation, is there even any data to back these performance concerns?
I don't have any public data that I can provide, but when a client has a running timer it restricts the ability of that process being put to sleep/reduce power usage state and therefore to reduce the amount of energy that is being burnt.
This issue mostly affects Clients (browser / mobile devices), but will also indirectly affect hosted node instances that will be constantly burning CPU even when it has no events/traffic -- thus costing users $$$
I am bumping this thread as it seems to have gone a bit dormant. We know of production impact in NodeJS/GraphQL that this PR resolves. In order to accelerate the customer and get them unblocked from leverage Batch Processor we had them fork the repo, rebase the branch and build it. They did this, and I can report that this has resolved the issue and they are no longer dropping spans. Also, we have not identified any of the issues related to performance concerns being discussed here. We're in agreement that the timer is a micro-optimization and do not believe it should hold this up any longer. If we want to continue debating the timer, I propose we raise a new issue and let this one pass.
If anyone would like more information, let me know what you want and I will try to provide it. We'd be really grateful if we could go ahead and see this make its way to being merged soon so we can pick up the fix and get them back into the main branch...
Cheers, Colin
@cfallwell my apologies I seem to have let this fall through the cracks. In the absence of public data from @MSNev I think I agree we should move forward. Any alternative implementation suggestions are welcome or concrete data to quantify the performance impact.
I'll put this on my schedule to review tomorrow.
My original comments still stand, in fact I have had to fix issues related to "background" timers this week for Application Insights because it kept causing the app to "wake up" the App and the burn CPU for no reason (as 99% of the time the queue was empty). This issue was raised by one of our internal partners who have an extremely robust perp framework and runtime measurements.
Also by NOT having a running timer in the background (and assuming that the timeout period is not too long), the need to unref
the timer diminishes as there is no running timer to stop the app from closing.
Using standard setInterval
/ clearInterval
(I could not change to setTimeout
in our master
repo as we return the interval id as part of the API. https://github.com/microsoft/ApplicationInsights-JS/commit/0d970c7d7ff63dd1f924771e69d76ff00953bab3
But in the current release, we use a wrapper (which works for any runtime (Node, Browser, worker) https://github.com/microsoft/ApplicationInsights-JS/commit/d14c15bfe7cd4e8274b5b811c27711b85b000dd3 while this does use unref
this is because this background timer is generally only used for internal non-critical log reporting.
Both of these just pump the "log" event into our export pipeline (which has it's own timer (which is only running when there is something waiting))
#3958 solved this