otel-profiling-agent icon indicating copy to clipboard operation
otel-profiling-agent copied to clipboard

Restructure of symbols information caching

Open florianl opened this issue 10 months ago • 1 comments

At the moment the reporter package caches symbols information. This can lead to various race conditions, like https://github.com/open-telemetry/opentelemetry-ebpf-profiler/issues/250, https://github.com/open-telemetry/opentelemetry-ebpf-profiler/issues/248 or https://github.com/open-telemetry/opentelemetry-ebpf-profiler/issues/235. For statefule protocols, caching at reporting level, can be helpfull and reduce data. But with the switch to a stateless protocol with OTel profiling, this caching can become the source of multiple issues and additional complexity.

Suggestion

Instead of caching in the reporter package the symbols information should be cached within the scope of ProcessManager. ProcessManager is responsible to resolve symbol information anyway and is also aware of the lifetime of a process. Caching at ProcessManager level therefore allows to only report accurate symbols information for frames and resolve the symbol information if the information is not available in cache. Consequently it allows to simplify the reporter package as the reported information is always available and complete.

florianl avatar Mar 07 '25 11:03 florianl

TraceReporter provides multiple independent options to report a trace.

type TraceReporter interface {
	// ReportFramesForTrace accepts a trace with the corresponding frames
	// and caches this information before a periodic reporting to the backend.
	ReportFramesForTrace(trace *libpf.Trace)

	// ReportCountForTrace accepts a hash of a trace with a corresponding count and
	// caches this information before a periodic reporting to the backend.
	ReportCountForTrace(traceHash libpf.TraceHash, count uint16, meta *samples.TraceEventMeta)

	// ReportTraceEvent accepts a trace event (trace metadata with frames and counts)
	// and caches it for reporting to the backend. It returns true if the event was
	// enqueued for reporting, and false if the event was ignored.
	ReportTraceEvent(trace *libpf.Trace, meta *samples.TraceEventMeta)

	// SupportsReportTraceEvent returns true if the reporter supports reporting trace events
	// via ReportTraceEvent().
	SupportsReportTraceEvent() bool
}

Source: TraceReporter

In a first step I suggest to simplify the above interface to the following:

type TraceReporter interface {
	// ReportTraceEvent accepts a trace event (trace metadata with frames and counts)
	// and caches it for reporting to the backend. It returns true if the event was
	// enqueued for reporting, and false if the event was ignored.
	ReportTraceEvent(trace *libpf.Trace, meta *samples.TraceEventMeta)
}

This reduction of the API reduces complexity when handling with the TraceReporter interface. To avoid calling m.traceProcessor.ConvertTrace(bpfTrace) for every trace, the result of this operation should still be cached in the scope of tracehandler. But instead of reporting counts and frames/metadata independently, reporter should receive always the full information. This not only simplifies package tracehandler but also package reporter as incoming data is channeled and always complete.

Especially with on CPU profiling, where there is a high variance in leaf frames, the expected overhead is negligible as m.traceProcessor.ConvertTrace(bpfTrace) is called any way due to the variance in the leaf frames.

florianl avatar Mar 13 '25 12:03 florianl

Currently working on a plan how to refactor the core infrastructure for this.

fabled avatar Jul 19 '25 19:07 fabled