OTEP: Process Context: Sharing Resource Attributes with External Readers
Changes
External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. We propose a mechanism for OpenTelemetry SDKs to publish process-level resource attributes, through a standard format based on Linux anonymous memory mappings.
When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.
Why open as draft: ~~I'm opening this PR as a draft with the intention of sharing with the Profiling SIG for an extra round of feedback before asking for a wider review.~~
This OTEP is based on Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler, big thanks to everyone that provided feedback and helped refine the idea so far.
- [ ] Related issues #
- [x] Related OTEP(s) #212, #239
- [x] Links to the prototypes (when adding or changing features)
- [ ]
CHANGELOG.mdfile updated for non-trivial changes - [ ] Spec compliance matrix updated if necessary
Marking as ready for review!
So this would be a new requirement for eBPF profiler implementations?
My issue is the lack of safe support for Erlang/Elixir to do this. While something that could just be accessed as a file or socket wouldn't have that issue. We'd have to pull in a third party, or implement ourselves, library that is a NIF to make these calls and that brings in instability many would rather not have when the goal of our SDK is to not be able to bring down a users program if the SDk crashes -- unless they specifically configure it to do so.
So this would be a new requirement for eBPF profiler implementations?
No, hard requirement should not be the goal: for starters, this is Linux-only (for now), so right off the gate this means it's not going to be available everywhere.
Having this discussion is exactly why it was included as one of the open questions in the doc :+1:
Our view is that we should go for recommended to implement and recommended to enable by default.
In languages/runtimes where it's easy to do so (Go, Rust, Java 22+, possibly Ruby, ...etc?) we should be able to deliver this experience.
For others, such as Erlang/Elixir, Java 8-21 (requires a native library, similar to Erlang/Elixir), the goal would be to make it very easy to enable/use for users that want it, but still optional so as to not impact anyone that is not interested.
We should probably record the above guidance on the OTEP, if/once we're happy with it :thinking:
cc @open-telemetry/specs-entities-approvers for extra eyes
This PR was marked stale due to lack of activity. It will be closed in 7 days.
In light of our discussion here, I ran some more experiments. Listing possible fallbacks/alternatives to read-only mapping for when PR_SET_VMA_ANON_NAME is not available:
- Deterministic address generation scheme (fast check based on address pattern). Relies on
MAP_FIXED_NOREPLACE[Kernel 4.17+] (see here for something based on a similar premise I did a long time ago) memfd_create
Example code for the latter (may need more investigation but it looks promising and should be widely available?).
$ ./memfd
PID: 52591 MMAP at: 0x7fd86f5d5000
Forking...
Child PID: 52592
OTEL reader: OTEL publisher data
7fd86f5d5000-7fd86f5d7000 rw-p 00000000 00:01 6178 /memfd:OTELCTX (deleted)
EDIT: Using memfd_create and an inline payload, also allows a reader process to (optionally) mmap the target region into its own address space. See updated example code here.
$ ./memfd-mmap
[Writer] PID: 98784 FD: 3 MMAP at: 0x7fe176430000
Forking...
[Reader] PID: 98785 FD_NUM: 3
[Reader] process_vm_readv: OTEL publisher data
[Reader] mmap: OTEL publisher data
[parent]
7fe176430000-7fe176432000 rw-s 00000000 00:01 6187 /memfd:OTELCTX (deleted)
[child]
7fe176430000-7fe176432000 r--s 00000000 00:01 6187 /memfd:OTELCTX (deleted)
[Reader] mmap: OTEL publisher data
[Reader] mmap: OTEL publisher data
[Writer] exit
$ [Reader] mmap: OTEL publisher updated data
[Reader] mmap: OTEL publisher updated data
[Reader] mmap: OTEL publisher updated data
[Reader] exit
In light of our discussion here, I ran some more experiments. Listing possible fallbacks/alternatives to read-only mapping for when
PR_SET_VMA_ANON_NAMEis not available:
memfd_createExample code for the latter (may need more investigation but it looks promising and should be widely available?).
👋 So funny thing you mention memfd 😀. My colleagues at Datadog actually have previously built something close to what this OTEP proposes, using memfd, although in a slightly different way than your gist (of notice, not using mmap together with memfd).
The main reasons why we moved away from it for this OTEP were:
- We were concerned that custom seccomp profiles for containery things can/could block memfd
- Dealing with forks; although maybe with the "mmap over memfd" that's would maybe no longer be an issue (?)
- Dealing with reading: We actually used /proc/pid/fd to find the context, not /proc/pid/maps
- Dealing with updates: again since we didn't use "mmap over memfd" updating was different
I think both approaches are quite similar, especially when involving mmap. (E.g. I suspect we could as well mmap the region into the reader with the current approach in the OTEP. I do think we'd need some kind of cleanup mechanism to detect when the owner of a mapping has gone away, as otherwise I suspect the reader will keep the mapping alive?).
So in a way it's more what combination of blocks do we want to use (or mix) 🤔👋:
- Start from an anonymous mapping or start from a memfd? or mix/fallback from one to another?
- Use mmap or not
- How to find the context: name in maps? property of pages? look at fds? (or combination/fallback)
- Recreate to update or mutate in place
- If mutating in place, how exactly does that mutation work
- Dealing with some of the concurrency/forks/some of the other details
We at Datadog spent some time exploring the solution space, and tried to come up with a combination of the above that seemed reasonable, given the constraints (and tried to document that as well).
But yeah, I won't say it's not possible to do any of the above in a slightly different way, especially given most have trade-offs and there's not been a very clear above-the-rest winner on most points. 😅
But yeah, I won't say it's not possible to do any of the above in a slightly different way, especially given most have trade-offs and there's not been a very clear above-the-rest winner on most points. 😅
I think we can come up with something that's flexible but also remains simple for the simple use-cases. The main advantage of memfd_create is that we won't need a page perm RO or different page property fallback as it's available on all kernel versions we care about. Secondarily, it allows for easy mmap in a reader process (alternative ways to do that are an actual file on the filesystem which is tricky with containers or shared memory of some sort, whether SystemV / POSIX).
To support wakeups instead of polling, without hooking prctl, we can use eventfd or even futex (these are also not tied to memfd_create, can be used with any of the other options we discussed).
I think that the scheme we end up with in OTel should at least meet the following three criteria:
- As simple as it gets
- Doesn't recreate the mapping on every update (allows readers to cache mapping address and skip
/procafter context is first established) - Doesn't rely on hooking
prctlfor one-to-many reader wakeups (but also, doesn't require polling)
Based on all the options we laid out, I think that's doable.
Optionally (we probably need to expand the scope to thread context to figure out requirements / pick through the following):
- Allow for reader to
mmapthe region (also means mapping needs to beMAP_SHAREDinstead ofMAP_PRIVATE[2]) - Inline payload updates (if we allow
mmapthis becomes a requirement) - Allow for variable (high/low) frequency of updates
[1] Regarding forking, there's a race condition between mmap and madvise(MADV_DONTFORK) which may infrequently manifest, as we don't control the code running inside the publisher process (meaning, we can not avoid a fork taking place after our mmap and before our madvise). However, if we start with a MAP_PRIVATE mapping and only write the fixed header after madvise, we can guarantee that the inherited mapping will never pass verification in a forked process and thus will be skipped by readers.
[2] If we allow mmap, we need to use a MAP_SHARED mapping. AFAIK it's not possible in Linux to start with a private mapping, call madvise, switch to a shared mapping and have madvise take effect for the latter mapping (unmapping the first mapping will destroy the VMA that madvise affected). Instead, we can add a PID field to the fixed header that readers can use during verification to skip the mapping.
Update for extra context: @ivoanjo and me had a Zoom sync today where we talked about simplifying the current proposal by:
- Defaulting to
memfd_createand havingPR_SET_VMA_ANON_NAMEas a fallback (@ivoanjo discovered thatprctlcan still be used with memfd mappings which means we can have a consistent name through both approaches) - Removing the search for the mapping based on read-only status
- Removing the need to flip mapping between read-only and read-write
- Keeping the mapping fixed in memory instead of recreating it on each update resulting in different address
- For lock-free updates that maintain payload consistency, use a counter scheme. To avoid introducing an extra field just for the counter, use the existing timestamp (TODO: unixtime isn't strictly monotonic but this shouldn't affect the scheme).
- Keep
prctlas a notification method (it's not strictly needed on the part of the readers which can choose to ignore it and poll at their own frequency)
For future discussion:
- Clarify if we need/want to allow reader
mmap(introduces inline payload requirement) - Clarify if we need/want alternative one-to-many userspace-only notification mechanism (e.g. eventfd)
I've pushed https://github.com/open-telemetry/opentelemetry-specification/pull/4719/commits/3caecfba222bdec728b6e7363800f27547c3cf8d with the changes described/discussed with @christos68k above.
I'm preparing a PR to update the reference C/C++ implementation to match this change; I'll share that one shortly.
The update to the reference C/C++ implementation is in https://github.com/open-telemetry/sig-profiling/pull/34 .
As a final quick note I + possibly other folks are going to be out for holidays the next few weeks, so expect discussions to slow down for a bit until we're back in full force in January!