opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

[exporterhelper] persist spancontext through persistent queue

Open jackgopack4 opened this issue 7 months ago • 7 comments

Description

Adds code to marshal/unmarshal current SpanContext in the persistent queue. This will enable proper tracking of internal telemetry (these internal spans were getting dropped when using storage/persistent queue).

Link to tracking issue

Fixes #11740 Fixes #12212

Testing

Unit tests for new functionality

Documentation

.chloggen

jackgopack4 avatar Apr 28 '25 18:04 jackgopack4

Codecov Report

Attention: Patch coverage is 97.08029% with 4 lines in your changes missing coverage. Please review.

Project coverage is 91.32%. Comparing base (8377ee7) to head (b8b78a9). Report is 40 commits behind head on main.

Files with missing lines Patch % Lines
...rterhelper/internal/queuebatch/persistent_queue.go 95.40% 2 Missing and 2 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12934      +/-   ##
==========================================
+ Coverage   91.27%   91.32%   +0.05%     
==========================================
  Files         508      509       +1     
  Lines       28736    28840     +104     
==========================================
+ Hits        26228    26338     +110     
+ Misses       1992     1988       -4     
+ Partials      516      514       -2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Apr 28 '25 18:04 codecov[bot]

hello @mx-psi @bogdandrutu if either of you have time to take a look. Thanks.

jackgopack4 avatar May 01 '25 14:05 jackgopack4

Are you sure we need to persist span links at all? I think I mentioned this before, but I'm pretty sure we only need to persist the SpanContext across the queue; span links are only created after the queue, based on the current SpanContext, in the batcher.

good point, I should maybe clean up the variable names to make it clear I am only persisting the SpanContext of the Link struct with this change

jackgopack4 avatar May 05 '25 16:05 jackgopack4

Basic benchmark (BenchmarkPersistentQueue) with new approach (separate key for context item, additional Storage Ops)

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue-16       596360        2054 ns/op      1197 B/op       49 allocs/op
PASS
ok    go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch 2.329s

Same benchmark with previous commit (byte-based wrapper alongside request, no additional Storage Ops)

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue-16       781249        1606 ns/op       898 B/op       37 allocs/op
PASS
ok    go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch 1.748s

With the contrived benchmark (Nop storage client, simple uint64 encoding), skipping extra storage steps is faster. I am looking into the specifics of the benchmark I wrote yesterday to see if there are any issues, and if I can run it against this branch instead.

With the more realistic benchmark added in my fork in Datadog repo, see the following results:

Add SpanContext as separate key and add Storage operations

go test -benchmem -run=^$ -bench ^BenchmarkPersistentQueue_PtraceTraces$ go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue_PtraceTraces-16    	    3501	    296079 ns/op	 4146273 B/op	     650 allocs/op
PASS
ok  	go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch	2.269s

Pass marshaled SpanContext as extra bytes along with the request, no extra Storage operations

go test -benchmem -run=^$ -bench ^BenchmarkPersistentQueue_PtraceTraces$ go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue_PtraceTraces-16    	    2533	    541251 ns/op	 6210329 B/op	     639 allocs/op
PASS
ok  	go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch	2.055s

As we can see, when there is more data than a single uint64 type being passed in the request, as well as an actual SpanContext, it is actually somewhat more efficient to create more storage operations.

jackgopack4 avatar May 09 '25 19:05 jackgopack4

something odd is happening with test-coverage failing when I move some of the tests and logic into a separate file. Will look into this further next week.

jackgopack4 avatar May 30 '25 21:05 jackgopack4

PTAL at the linter failure

dmitryax avatar Jun 02 '25 18:06 dmitryax

I've prototyped a PR that addresses comments from @bogdandrutu and brings us to the end state that we discussed. I can proceed with breaking it down to mergable PRs once I get an initial review

dmitryax avatar Jun 06 '25 02:06 dmitryax

@jackgopack4 thank you for the starting the work! https://github.com/open-telemetry/opentelemetry-collector/pull/13188 is merged now, so this can be closed

dmitryax avatar Jun 19 '25 18:06 dmitryax