opentelemetry-collector [exporterhelper] persist spancontext through persistent queue

Description

Adds code to marshal/unmarshal current SpanContext in the persistent queue. This will enable proper tracking of internal telemetry (these internal spans were getting dropped when using storage/persistent queue).

Link to tracking issue

Fixes #11740 Fixes #12212

Testing

Unit tests for new functionality

Documentation

.chloggen

Apr 28 '25 18:04 jackgopack4

Codecov Report

Attention: Patch coverage is 97.08029% with 4 lines in your changes missing coverage. Please review.

Project coverage is 91.32%. Comparing base (8377ee7) to head (b8b78a9). Report is 40 commits behind head on main.

Files with missing lines	Patch %	Lines
...rterhelper/internal/queuebatch/persistent_queue.go	95.40%	2 Missing and 2 partials :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #12934      +/-   ##
==========================================
+ Coverage   91.27%   91.32%   +0.05%     
==========================================
  Files         508      509       +1     
  Lines       28736    28840     +104     
==========================================
+ Hits        26228    26338     +110     
+ Misses       1992     1988       -4     
+ Partials      516      514       -2

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Apr 28 '25 18:04 codecov[bot]

hello @mx-psi @bogdandrutu if either of you have time to take a look. Thanks.

May 01 '25 14:05 jackgopack4

Are you sure we need to persist span links at all? I think I mentioned this before, but I'm pretty sure we only need to persist the SpanContext across the queue; span links are only created after the queue, based on the current SpanContext, in the batcher.

good point, I should maybe clean up the variable names to make it clear I am only persisting the SpanContext of the Link struct with this change

May 05 '25 16:05 jackgopack4

Basic benchmark (BenchmarkPersistentQueue) with new approach (separate key for context item, additional Storage Ops)

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue-16       596360        2054 ns/op      1197 B/op       49 allocs/op
PASS
ok    go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch 2.329s

Same benchmark with previous commit (byte-based wrapper alongside request, no additional Storage Ops)

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue-16       781249        1606 ns/op       898 B/op       37 allocs/op
PASS
ok    go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch 1.748s

With the contrived benchmark (Nop storage client, simple uint64 encoding), skipping extra storage steps is faster. I am looking into the specifics of the benchmark I wrote yesterday to see if there are any issues, and if I can run it against this branch instead.

With the more realistic benchmark added in my fork in Datadog repo, see the following results:

Add SpanContext as separate key and add Storage operations

go test -benchmem -run=^$ -bench ^BenchmarkPersistentQueue_PtraceTraces$ go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue_PtraceTraces-16    	    3501	    296079 ns/op	 4146273 B/op	     650 allocs/op
PASS
ok  	go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch	2.269s

Pass marshaled SpanContext as extra bytes along with the request, no extra Storage operations

go test -benchmem -run=^$ -bench ^BenchmarkPersistentQueue_PtraceTraces$ go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue_PtraceTraces-16    	    2533	    541251 ns/op	 6210329 B/op	     639 allocs/op
PASS
ok  	go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch	2.055s

As we can see, when there is more data than a single uint64 type being passed in the request, as well as an actual SpanContext, it is actually somewhat more efficient to create more storage operations.

May 09 '25 19:05 jackgopack4

something odd is happening with test-coverage failing when I move some of the tests and logic into a separate file. Will look into this further next week.

May 30 '25 21:05 jackgopack4

PTAL at the linter failure

Jun 02 '25 18:06 dmitryax

I've prototyped a PR that addresses comments from @bogdandrutu and brings us to the end state that we discussed. I can proceed with breaking it down to mergable PRs once I get an initial review

Jun 06 '25 02:06 dmitryax

@jackgopack4 thank you for the starting the work! https://github.com/open-telemetry/opentelemetry-collector/pull/13188 is merged now, so this can be closed

Jun 19 '25 18:06 dmitryax

opentelemetry-collector opentelemetry-collector copied to clipboard

[exporterhelper] persist spancontext through persistent queue

Description

Link to tracking issue

Testing

Documentation

Codecov Report

opentelemetry-collector
opentelemetry-collector copied to clipboard