opentelemetry-go icon indicating copy to clipboard operation
opentelemetry-go copied to clipboard

Add warning log on dropped spans in batch span processor

Open prodion23 opened this issue 3 years ago • 11 comments

This aims to add warning logging during batch span processor dropping spans.

https://github.com/open-telemetry/opentelemetry-go/issues/3264

One thing I'm unsure about is the logger doesn't offer a Warn method already, I added one here - however, unsure if we want to keep that change.

https://github.com/open-telemetry/opentelemetry-go/blob/main/internal/global/internal_logging.go#L45-L63

Maybe the info log should use a low priority number so that we have somewhere to fit warn into this?

prodion23 avatar Oct 14 '22 19:10 prodion23

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: prodion23 / name: Donnie (c97924f7d4ab74f3d908ea91741f88099788b1a2)

@Aneurysm9 thank you for the feedback here.

One thing to draw specific scrutiny towards, unfortunately we can't use the WithBlocking setting for this test because it never increments drop counts on that code path.

https://github.com/prodion23/opentelemetry-go/blob/0052a23c290951aeeae8c6adef5c6219d34d8616/sdk/trace/batch_span_processor.go#L362-L363

https://github.com/prodion23/opentelemetry-go/blob/0052a23c290951aeeae8c6adef5c6219d34d8616/sdk/trace/batch_span_processor.go#L362-L363

During the test, it consistently seems to flake if I run it with a small genNumSpans argument, essentially anything less than 10 would fail 50% of the time.

Setting it to 100 has consistently passed but understand if there is concern with using a known flappy configuration. With genNumSpans set to 100 the log statement shows that between ~80 to ~90 spans are dropped.

Happy to revisit this test if you have suggestions.

prodion23 avatar Oct 18 '22 20:10 prodion23

I think using the non-blocking-on-full-queue mode for this test is probably appropriate, but would love to get the opinion of @open-telemetry/go-approvers. This component has a long history of flaky tests, but it looks like the test you have proposed should be fairly reliable.

Please ensure that all commits are associated with a user that has signed the CLA as we will be unable to proceed further (and some approvers may not review the PR) until the CLA is signed.

Aneurysm9 avatar Oct 18 '22 23:10 Aneurysm9

It looks like the added test is quite flaky or does not succeed on windows: https://github.com/open-telemetry/opentelemetry-go/actions/runs/3380083340/jobs/5613037464

We need our testing system to be solid and therefore this needs to be refactored or redesigned to ensure tests pass.

MrAlias avatar Nov 04 '22 21:11 MrAlias

I updated the test to make it more robust. @MrAlias and @Aneurysm9 can you please review to see if this would be acceptable?

I didn't test in windows, but I found the old version would often fail if ran with go test -run ^TestBatchSpanProcessorLogsWarningOnNewDropSpans$ -count 50, but this version doesn't.

MadVikingGod avatar Nov 30 '22 20:11 MadVikingGod

Codecov Report

Merging #3289 (9af553a) into main (291aaa0) will increase coverage by 0.2%. The diff coverage is 100.0%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3289     +/-   ##
=======================================
+ Coverage   78.0%   78.2%   +0.2%     
=======================================
  Files        165     165             
  Lines      11755   11762      +7     
=======================================
+ Hits        9177    9207     +30     
+ Misses      2382    2359     -23     
  Partials     196     196             
Impacted Files Coverage Δ
sdk/trace/batch_span_processor.go 82.0% <100.0%> (+0.9%) :arrow_up:
sdk/resource/resource.go 83.5% <0.0%> (+5.7%) :arrow_up:
sdk/trace/provider.go 95.5% <0.0%> (+6.2%) :arrow_up:

codecov[bot] avatar Nov 30 '22 20:11 codecov[bot]

Looks like the tests are still failing: https://github.com/open-telemetry/opentelemetry-go/actions/runs/3586714280/jobs/6036245937

I'm more in favor of stable testing than accepting these changes as is.

MrAlias avatar Dec 01 '22 00:12 MrAlias

I'm more in favor of stable testing than accepting these changes as is.

Me too, the problem here is to observe dropped spans we can't use the blocking BSP, because it never drops any spans.

MadVikingGod avatar Dec 02 '22 20:12 MadVikingGod

Looks like the tests are still failing: https://github.com/open-telemetry/opentelemetry-go/actions/runs/3586714280/jobs/6036245937

I'm more in favor of stable testing than accepting these changes as is.

The test race appears to be fixed by 9af553a. Are there more changes required before this can merge?

alawrenc avatar Apr 10 '23 19:04 alawrenc

Are there more changes required before this can merge?

"This branch has conflicts that must be resolved" 😉

pellared avatar Apr 11 '23 08:04 pellared

@prodion23 Are you planning to refresh the PR and address the comment(s)?

pellared avatar Apr 26 '23 08:04 pellared

Closing this PR as stale, and with conflicts that were not resolved.

dmathieu avatar Jul 17 '24 13:07 dmathieu