opentelemetry-go-instrumentation
opentelemetry-go-instrumentation copied to clipboard
PoC: Custom SDK
PoC for #954
This is a proof-of-concept for an SDK fully implemented by the auto-instrumentation. This supports all span functionality:
- Sampling (TODO: the
samplemethod needs to be instrumented) - Random correct ID generation
- All
StartoptionsWithLinksWithNewRootWithSpanKind(defaults to probeSpanKindif not set)WithTimestampWithAttributes
- The
AddEventmethod, including all optionsWithStacktraceWithAttributesWithTimestamp
- The
AddLinkmethod - The
IsRecordingmethod (TODO: based on sampling support) - The
SpanContextmethod - The
SetStatusmethod - The
SetAttributemethod - The
TracerProvidermethod - All
EndoptionsWithTimestamp
Design
auto.GetTracerProvider
There is only one function exported publicly. This is GetTracerProvider in go.opentelemetry.io/auto.
This function returns a singleton instance of an opentelemetry-go trace.TracerProvider that is held in the internal/sdk package.
internal/sdk
The go.opentelemetry.io/auto/internal/sdk package is added. This is a "full feature" OTel trace SDK from the perspective of the Tracer and Span.
All data about any Span created will be built in userspace. This is stored (mostly) in the collector's ptrace.Traces type.
When the Span is ended the ptrace.Traces is marshaled into a proto binary encoding and passed as a buffer to the ended method of the Span. This method does nothing and is expecting a uprobe to be inserted at its call site.
auto/sdk probe
A simple probe is added to instrument the go.opentelemetry.io/auto/internal/sdk package. This probe does not rely on any offsets from the sdk types and simply routes the encoded span data from ended to the events eBPF map.
From there the ptrace.Traces data is unmarshaled and parsed into a SpanEvent that the Controller processes in the normal fashion.
Demo
Run Jaeger
$ docker run --rm --name jaeger -e COLLECTOR_OTLP_ENABLED=true -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one:latest
2024/08/28 20:38:34 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
# ...
Run the example
$ cd examples/auto-sdk && go build -o $GOPATH/bin/example && $GOPATH/bin/example
outter-0...done
outter-1...
Run the auto-instrumentation
$ cd cli && go build
$ OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
OTEL_GO_AUTO_TARGET_EXE=$GOPATH/bin/example \
OTEL_SERVICE_NAME=example \
sudo -E ./cli
{"level":"info","ts":1724885322.1967607,"logger":"go.opentelemetry.io/auto","caller":"cli/main.go:86","msg":"building OpenTelemetry Go instrumentation ...","globalImpl":false}
# ...
{"level":"info","ts":1724885324.8517134,"logger":"go.opentelemetry.io/auto","caller":"cli/main.go:115","msg":"instrumentation loaded successfully"}
You can also run with debug logging:
$ OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
OTEL_GO_AUTO_TARGET_EXE=$GOPATH/bin/example \
OTEL_SERVICE_NAME=example \
sudo -E ./cli -log-level=debug
Let this run for a bit and then stop the example. Stopping the example while there is a span active means you will get an error. E.g.
go build -o $GOPATH/bin/example && $GOPATH/bin/example
outter-0...done
outter-1...^Cdone
(notice the ^C is before the second done)
Review the span
Overview
Spans with recorded errors (via events)
Span links
Open Issues/Questions
- [ ] A maximum span serialization size of
412is only supported- Ways to increase eBPF storage past the stack limit (512) need to be investigated
- When we know the span is going to be too big, we need to drop attributes, links, and events in userspace
- [x] Sampling needs to be implemented.
- [ ] Fix call to
bpf_probe_read: https://github.com/open-telemetry/opentelemetry-go-instrumentation/actions/runs/10605550069/job/29394558469?pr=1045 - [x] Currently the
SpanEventstart and end times are relative offsets to the eBPF process time. This is changed in this PR, thereby breaking all other probes. - [ ] Do we want to use
ptracefrom the collector as the serialization format? Do we want to build our own? - [ ] This adds more uses of the
bpf_probe_write_user. Can we use pinned eBPF maps to bypass this and communicate across processes?