[Feedback] Profiles premature optimization
There is a premature optimization in the profiles proto https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/profiles/v1development/profiles.proto#L53 and the whole ProfilesDictionary idea. The premature optimization is not that it exists this dictionary, is that it is a global dictionary instead of a resource level dictionary. The benefit is very minimal compare with the huge downside that merging multiple requests coming from multiple resources is impossible (or close to impossible).
The motivation stems from here and it's supporting host-level profiles where resources correspond to a container. Per-container profiles will have duplication such as kernel function names, system libraries function names. Or even application symbol names if multiple replicas of the app run on the host.
cc @open-telemetry/profiling-maintainers
+1 to what @aalexand said. We're aware of the tradeoffs, but merging profiles from different requests is a less important use case to us than supporting container level resource profiles.
We expect that merging only needs to be implemented once in the collector, and the complexity is acceptable there.
Let us know if this makes sense or if we should continue the discussion.
The motivation stems from https://github.com/open-telemetry/opentelemetry-proto/issues/628#issuecomment-2772566325 and it's supporting host-level profiles where resources correspond to a container. Per-container profiles will have duplication such as kernel function names, system libraries function names. Or even application symbol names if multiple replicas of the app run on the host.
I understand, but I do believe the tradeoff does not have a good ROI, mostly because in a case (assume 200 containers on a host) the win of doing manual encryption vs relying on some compression mechanism snappy/zstd/etc. does not make sense to me.
I understand, but I do believe the tradeoff does not have a good ROI, mostly because in a case (assume 200 containers on a host) the win of doing manual encryption vs relying on some encryption mechanism snappy/zstd/etc. does not make sense to me.
I assume you mean compression rather than encryption?
As far as potential efficiency arguments are concerned, could you elaborate a bit more? Wouldn't the case of having 200 containers on a host greatly benefit from a global directory rather than a resource level one? I'd imagine that many/most of these containers would be using the same languages and runtimes with shared symbols for stdlib and common libraries?
We're definitely happy to take your feedback into account, but right now it's a bit too vague to be actionable.
I think what @bogdandrutu is saying is that zstd and other general purpose compression algos are really good at eliminating this duplication and even with 200-fold repetition of dictionaries after the compression on the wire you will likely see a fairly small total size difference. I think that's fair and it would be useful to see benchmarks that show this (or disprove it).
However, while that is true about wire sizes, it is not relevant for in-memory sizes. After you uncompress the payload in-memory you do end up with 200 duplicate dictionaries, with potentially a lot more memory usage.
It is also possible that much larger uncompressed input to zsts/snappy/etc cause an increase in compute even if the compressed output size ends up being similar.
I would really love to see some benchmarking done that demonstrate these aspects (wire size, in-mem usage, compress+serde compute) when changes are made to the proto for the purpose of optimization.
Wouldn't the case of having 200 containers on a host greatly benefit from a global directory rather than a resource level one?
How do you get to the point of producing this "global directory". You will most likely pay more CPU to construct this out of multiple incoming requests than a highly optimized compression algorithm that have lots of small engineers behind.
I do feel that this is a hypothetical scenario, with real good understanding on how do we get to really benefit of this. So I stand by my words that this is a premature optimization, and most likely we will never be able to merge these maps across multiple sources (or very rarely) and does not justify the complications when it comes to usability.
I assume you mean compression rather than encryption?
Thank you for fixing my words on this.
How do you get to the point of producing this "global directory". You will most likely pay more CPU to construct this out of multiple incoming requests than a highly optimized compression algorithm that have lots of small engineers behind.
In-memory data is not compressed though. If a given machine runs 100 containers that are replicas of the same program and each container profile references 10000 functions with each function name being 100 bytes on average, the current approach seems like a reasonable way to store 1 MB of function names in memory instead of 100 MB. I wouldn't say it's hypothetical.
most likely we will never be able to merge these maps across multiple sources (or very rarely)
Can you clarify? It's not very clear what inability to merge this is about.
@bogdandrutu can you outline the specific use case you have in mind? Premature optimization can be very subjective term when different stakeholders get together.
I think @aalexand and my perspective is biased by working on profilers and tools that analyze/visualize profiling data. I suspect you're biased towards collector use cases? Or something different?
Maybe you can join our next meeting on Thursday for a high bandwidth discussion?
The profiling SIG has spend a lot of time discussing the tradeoffs of leaving compression to general purpose compressor algorithms versus leveraging lookup tables and other domain specific knowledge about data distributions. The general conclusion is that neither of the extreme approaches are good. You can't leave it all to gzip and friends, nor can you do it all in the wire format. There is obviously a lot of room to argue the details when it comes to balancing the tradeoffs, but I just want to reassure you that we're not ignorant of them and had a lot of discussions about this already.
In-memory data is not compressed though. If a given machine runs 100 containers that are replicas of the same program and each container profile references 10000 functions with each function name being 100 bytes on average, the current approach seems like a reasonable way to store 1 MB of function names in memory instead of 100 MB. I wouldn't say it's hypothetical.
Maybe my knowledge about containers is wrong, but I do expect that you cannot share this map across different containers (unless you do SHM, etc.), so my question still stands, how do you get into 1 linux process with all these data so that you benefit out of this? And more than how you get, how much you pay to get to build this map?
Can you clarify? It's not very clear what inability to merge this is about.
Can you please start and explain to me a basic setup where you have on a VM with 100 containers and how do you get with all the data in 1 place so that you can build this map.
A profiler that uses something like Linux perf in system-wide mode profiles all processes and containers at once. https://github.com/open-telemetry/opentelemetry-ebpf-profiler is not per cgroup either I think. So the per cgroup data would get compacted early.
Another example is processing the profiling data on the backend - e.g. offline symbolization. Decoding a profile that has duplicated per cgroup strings will have the in-memory duplication I mentioned earlier.
@bogdandrutu as mentioned by @aalexand, a major use case is the ebpf profiler. We're happy to elaborate further on this. But I'd like to reiterate our requests for learning more about your use cases as well. How do you intend to build on the profiling signal? What profiling data source (profiler) are you planning to tap into?
@felixge I am coming from usability perspective, these data are very hard to use by a backend or by any transformation service. Wearing my TC hat, I need to ensure that we are making the right tradeoffs of usability vs some "hypothetical" performance win.
@bogdandrutu Are there any specific use cases you have in mind, where usability is being hurt by ProfilesData being a global dictionary?
As mentioned previously, https://github.com/open-telemetry/opentelemetry-proto/issues/628 covers the tradeoffs we discussed at length with the Specifications SIG and TC members, including our use cases for which profiles cover multiple containers at once.
We're aware that merging profiles from different requests becomes harder, and we deemed this is a less important use case to support. Profiling data can be tens of megabytes large in-memory (with the current optimizations factored in), and it's unclear that there's any benefit in merging that data in-memory in the collector. In the eventuality merging makes sense, the additional incurred complexity for implementing merging is limited to a set of components in the Collector.
On the other hand, the OpenTelemetry eBPF Profiler is an important use case for us, since it is currently one of the main receivers that emits profiling data. It is a host profiler: it sees all processes on the host, whether they're containerized or not. You can see both https://github.com/open-telemetry/opentelemetry-proto/issues/628 and https://docs.google.com/document/d/1Ud2EQZMmFCYOhdSXW0VFHIdKb-Ltzd03bImZgIC1syM/edit which covers at length the problems we're trying to solve.
In a gist, we want the profiles emitted by the OpenTelemetry eBPF profiler, to define one resource per container ID and unique sets of resource attributes. Without this, we're unable to integrate with processors such as the k8sattributesprocessor and enrich the profiling data with container-level attributes, service-level attributes, ... Check out out the documents I linked above for more information.
I'll reiterate @felixge's suggestion - maybe you can join our next profiling SIG meeting on Thursday 2025/09/18 for a higher bandwidth discussion?
I am coming from usability perspective, these data are very hard to use by a backend or by any transformation service.
Coming from the extensive Java ecosystem where JFR has used a similar space-efficient file data encoding for decades, I do not find this to be the case. Only a small number of low level libraries have to handle the encode/decode directly, to the extent that the format is not even part of the platform standard, but considered an internal implementation detail. The vast majority of analytics tools never see it - they operate on a sample (event) centric API that masks the lookup tables entirely. You're only seeing a problem here because OTel doesn't have that abstraction layer to do the pointer indirection implemented yet. That gap is more likely the real issue, not the complexity of the wire message format.
@bogdandrutu Are there any specific use cases you have in mind, where usability is being hurt by ProfilesData being a global dictionary?
Any sort of transformation is almost impossible to do with this.
We're aware that merging profiles from different requests becomes harder, and we deemed this is a less important use case to support. Profiling data can be tens of megabytes large in-memory (with the current optimizations factored in), and it's unclear that there's any benefit in merging that data in-memory in the collector. In the eventuality merging makes sense, the additional incurred complexity for implementing merging is limited to a set of components in the Collector.
First of all we are not talking about "Collector" here, but in general of any processing to be done for these data. In majority of the "worlds" that I am part of, we collect profiling data from one process at a time so this optimization does not make any sense but only complicate things, because I have to make sure not multiple processes are included, etc. OTLP needs to be an efficient protocol for 99%+ of the use-cases not for the edge 1% which is the only one use-case I heard this helps in the ebpf agent case.
Coming from the extensive Java ecosystem where JFR has used a similar space-efficient file data encoding for decades, I do not find this to be the case. Only a small number of low level libraries have to handle the encode/decode directly, to the extent that the format is not even part of the platform standard, but considered an internal implementation detail. The vast majority of analytics tools never see it - they operate on a sample (event) centric API that masks the lookup tables entirely. You're only seeing a problem here because OTel doesn't have that abstraction layer to do the pointer indirection implemented yet. That gap is more likely the real issue, not the complexity of the wire message format.
In JFR is only one process producing pprof, so this cross processes optimization does not apply.
Any sort of transformation is almost impossible to do with this.
I'd still like to see a specific example here.
"This is a premature optimization" is a weaker claim compared to "This breaks use cases X and Y".
In JFR is only one process producing pprof, so this cross processes optimization does not apply.
Sorry, I don't follow your argument there. Yes, JFR collects data from one process at a time, though not in pprof format. JFR, like the OTel format, uses dictionary encoding. Are you saying JFR suffers from premature optimization too and should not have done that? I think a lot of JFR users may disagree.
My take from discussion, as someone who's tinkering both the profiling SIG and the collector is that the dictionary makes data manipulation in the collector rather harder. It's definitely a bit of a pain.
However, we also have a fair number of abstractions. Whether it's Go helpers such as SetString and other similar helpers, or plain features of the collector such as OTTL (where we're obviously not exposing the dictionary). The pain points are for folks who have to handle those abstractions.
To be fair, it has taken us a bit of time to set everything up (and we're not fully there yet, as for example, we're still unable to merge profiles).
At the same time, I don't think we have any actual data on whether benefits added by the dictionary counterweight the advantages within the agent and in terms of used bandwidth.
That being said, I have seen a couple of asks for something similar for other signals (here's one example), and I wouldn't be surprised if we started seeing more.
At this point in the implementation of profiling, and based on the fact that we are working around the complexity of the dictionary, I would therefore not go back on this decision.
For a proto v2, it would be nice to settle on a common way of doing things across all signals. Either a dictionary or none, but the same way everywhere.
At this point in the implementation of profiling, and based on the fact that we are working around the complexity of the dictionary, I would therefore not go back on this decision.
The proposal is not to not use dictionaries, but to move them to be a resource level specific thing not across resources.
Also, if we add support for other signals we will do the same.
@bogdandrutu Are there any specific use cases you have in mind, where usability is being hurt by ProfilesData being a global dictionary?
Any sort of transformation is almost impossible to do with this.
Absolutely disagree.
If you look at the arrow ecosystem and algorithms designed to leverage simd instructions better, sparse vectors and index access may actually lend to more efficient (in flight) transformation.
You do have to be willing to leave extra strings in the dictionary during transformation, but the tradeoff in performance can be huge, can be on par. It all depends on the use case, and there's other possible implementations as well.
I think it's fair to ask for benchmarking, but lets not make snap judgements on technological possibilities unless we've done our research.
If you look at the arrow ecosystem and algorithms designed to leverage simd instructions better, sparse vectors and index access may actually lend to more efficient (in flight) transformation.
You do have to be willing to leave extra strings in the dictionary during transformation, but the tradeoff in performance can be huge, can be on par. It all depends on the use case, and there's other possible implementations as well.
I think it's fair to ask for benchmarking, but lets not make snap judgements on technological possibilities unless we've done our research.
None of these exists in the current otel ecosystem, and if we agree that these are needed, then we should do it for all the signals not special for profiles.
Also, I am sorry that I am not an expert, so I cannot say they are better or worse but I fully trust you since you seem very sure about it. My point still stands, if there are better ways and we really need that then we should do it for the other signals as well.
Right now in the collector world for example we do lots of allocations when we call into OTTL (opentelemetry transform language - something that I think I know about) because we expect thing to be in a "attributes map", so for every sample we allocate 100s of objects to call into the transform procesor.