Rework the trace pipeline towards statelessness
Problem
Our trace processing pipeline is currently engineered towards a backend keeps the information that receives around forever in a bunch of places. When information was sent once, it often won't be sent again until agent restart. This is problematic for two reasons:
- The new OTLP protocol is stateless and requires us to send completely self-contained packets of data. The OTLP reporter currently works around this by keeping all the info in LRUs. This kind of works, but we run into issues of permanently missing symbols and executable names when the LRU starts evicting. It also costs a lot of memory.
- Stateful reporter implementations with backends that phase out old data after a fixed period will run into issues with data that got evicted but will never be resent unless the profiling agent is restarted periodically.
Affected information
The following information is currently prone to falling out of LRU without a chance of it ever being resent:
- Interpreter and kernel frame symbols
- Each interpreter handler currently implements a domain-specific approach to ensuring that frame info is sent just once for the lifetime of the interpreter process (i.e. Python)
- Kernel frame resends are suppressed with an LRU without an expiry
- Executable info
- Sent only once when the executable is first seen, never resent
Rough outline of a solution
We need to rework the whole trace pipeline to ensure that all of this information is available all the time. There are two possible paths that we can pursue here:
- Make all components resend all information all the time.
- Rework the pipeline to be query-based instead: if the reporter needs an executable name, it would go to the process manager and ask for it. Probably nice efficiency-wise, but results in ugly circle dependencies.
- Probably more options that I didn't think of when creating this issue
We can probably get rid of tracehandler entirely. The caches that it maintains
will likely go away and the remaining few lines can be merged directly into
Tracer.
Sub-issues
- [x] Investigate possible solutions (see comment)
- [ ] #171 for discussion about reporter and interpreter changes
- [ ] Make reporting of executable metadata stateless
- [ ] #181 (Make reporting of kernel symbols stateless)
The most important point of this issue is that some of the symbols are retrieved/reported only once per agent lifetime. This can even be problematic with a stateful backend, if data is removed, manually or via automatic data retention policies.
With a stateless protocol like the OTEL protocol, the issue becomes even more dominant. The agent core has been developed with a stateful protocol/backend in mind. So the switch to the stateless OTEL protocol requires changes in regards to caching (mostly symbols).
The possibly most important change is to move the caching of symbols out of the agent core into the Reporter implementation, which then decides about caching details and resending.
Consequently, the Reporter interface needs to be amended (as well as the agent core).
Possible solutions
-
The agent core passes always frame symbols to the reporter with every stacktrace. The downside would be increased CPU usage for creating arrays of symbols, even if not needed.
-
The agent core passes provides a function to return symbols, which is called by the reporter if needed. The downside is an ugly call/dependency recursion.
-
The
Reporterinterface provides a function that allows the agent core to ask whether symbols for a given frame are needed. The downside is that this function needs to be called very often (one call per frame).
@fabled works on a PoC PR to implement point 3 for further discussion and for doing benchmarks.
Additional required work
- Kernel modules are recognized only at agent startup. How can we parse their symbols lazily?
As far as I can say, this has been addressed (see tasklist). Closing.