opentelemetry-collector
opentelemetry-collector copied to clipboard
[exporterhelper] persist spancontext through persistent queue
Description
Adds code to marshal/unmarshal current SpanContext in the persistent queue. This will enable proper tracking of internal telemetry (these internal spans were getting dropped when using storage/persistent queue).
Link to tracking issue
Fixes #11740 Fixes #12212
Testing
Unit tests for new functionality
Documentation
.chloggen
Codecov Report
Attention: Patch coverage is 97.08029% with 4 lines in your changes missing coverage. Please review.
Project coverage is 91.32%. Comparing base (
8377ee7) to head (b8b78a9). Report is 40 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...rterhelper/internal/queuebatch/persistent_queue.go | 95.40% | 2 Missing and 2 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #12934 +/- ##
==========================================
+ Coverage 91.27% 91.32% +0.05%
==========================================
Files 508 509 +1
Lines 28736 28840 +104
==========================================
+ Hits 26228 26338 +110
+ Misses 1992 1988 -4
+ Partials 516 514 -2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
hello @mx-psi @bogdandrutu if either of you have time to take a look. Thanks.
Are you sure we need to persist span links at all? I think I mentioned this before, but I'm pretty sure we only need to persist the SpanContext across the queue; span links are only created after the queue, based on the current SpanContext, in the batcher.
good point, I should maybe clean up the variable names to make it clear I am only persisting the SpanContext of the Link struct with this change
Basic benchmark (BenchmarkPersistentQueue) with new approach (separate key for context item, additional Storage Ops)
goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue-16 596360 2054 ns/op 1197 B/op 49 allocs/op
PASS
ok go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch 2.329s
Same benchmark with previous commit (byte-based wrapper alongside request, no additional Storage Ops)
goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue-16 781249 1606 ns/op 898 B/op 37 allocs/op
PASS
ok go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch 1.748s
With the contrived benchmark (Nop storage client, simple uint64 encoding), skipping extra storage steps is faster. I am looking into the specifics of the benchmark I wrote yesterday to see if there are any issues, and if I can run it against this branch instead.
With the more realistic benchmark added in my fork in Datadog repo, see the following results:
Add SpanContext as separate key and add Storage operations
go test -benchmem -run=^$ -bench ^BenchmarkPersistentQueue_PtraceTraces$ go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue_PtraceTraces-16 3501 296079 ns/op 4146273 B/op 650 allocs/op
PASS
ok go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch 2.269s
Pass marshaled SpanContext as extra bytes along with the request, no extra Storage operations
go test -benchmem -run=^$ -bench ^BenchmarkPersistentQueue_PtraceTraces$ go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch
cpu: Apple M3 Max
BenchmarkPersistentQueue_PtraceTraces-16 2533 541251 ns/op 6210329 B/op 639 allocs/op
PASS
ok go.opentelemetry.io/collector/exporter/exporterhelper/internal/queuebatch 2.055s
As we can see, when there is more data than a single uint64 type being passed in the request, as well as an actual SpanContext, it is actually somewhat more efficient to create more storage operations.
something odd is happening with test-coverage failing when I move some of the tests and logic into a separate file. Will look into this further next week.
PTAL at the linter failure
I've prototyped a PR that addresses comments from @bogdandrutu and brings us to the end state that we discussed. I can proceed with breaking it down to mergable PRs once I get an initial review
@jackgopack4 thank you for the starting the work! https://github.com/open-telemetry/opentelemetry-collector/pull/13188 is merged now, so this can be closed