libstapsdt
libstapsdt copied to clipboard
Support for memory-backed file descriptor
This abstracts the creation of the temporary ELF file to allow for two possible approaches:
- The current (default) approach of allocating a temporary file to
/tmp
and dlopen'ing it - A new (optional) approach, to use a memory-backed file descriptor, if
LIBSTAPSDT_MEMORY_BACKED_FD
is specified.
The new approach his beneficial for two reasons:
- There is no need to clean up the file, it is automatically removed when the process dies (the primary reason for this approach).
- No calls are made to the underlying filesystem, everything is done in memory. While tmpfs is probably already memory-based, this could be advantageous in some circumstances.
libusdt does something similar, where it just valloc's directly into the processes address space, so there is no need for a temporary file to clean up.
Overall, I find that doing tho whole thing in memory and having it be automatically cleaned up with the process's lifecycle is a bit more elegant.
The drawbacks to this approach are:
- ~It is not yet supported by bcc, but I've submitted https://github.com/iovisor/bcc/pull/2314 to show how this could be utilized.~ Support has been added to BCC by https://github.com/iovisor/bcc/pull/2314
- If a large number of providers are allocated, the process will have a large number of file descriptors open. This is probably not a problem in practice.
- Support for
memfd_create
was not added until linux 3.17 (not a problem in practice, as BCC needs 4.1+ to do anything meaningful with eBPF) and glibc 2.27. The system call may not be defined on some systems.
I think the drawbacks are mitigated by hiding it behind a preprocessor macro check, as the behavior remains unchanged. In the future, hopefully once https://github.com/iovisor/bcc/pull/2314 is in a major bcc release, we could remove the macro and default to a memory-based shared object, falling back to temporary file if we detect that memfd_create
is not supported.
Attaching a probe (global:helloworld
) this way, the resulting process looks like:
/proc/${PID}/maps:
...
7f9e79f0c000-7f9e79f0d000 r-xp 00000000 00:05 11706507 /memfd:libstapsdt:global (deleted)
7f9e79f0d000-7f9e7a10c000 ---p 00001000 00:05 11706507 /memfd:libstapsdt:global (deleted)
7f9e7a10c000-7f9e7a10d000 rw-p 00000000 00:05 11706507 /memfd:libstapsdt:global (deleted)
...
And if we check out the file descriptors for the process:
lrwx------ 1 dale.hamel dale.hamel 64 Apr 14 20:57 7 -> '/memfd:libstapsdt:global (deleted)'
So if we check for elf notes on that fd:
$readelf --notes /proc/${PID}/fd/7
Displaying notes found in: .note.stapsdt
Owner Data size Description
stapsdt 0x00000039 NT_STAPSDT (SystemTap probe descriptors)
Provider: global
Name: helloword
Location: 0x0000000000000260, Base: 0x0000000000000318, Semaphore: 0x0000000000000000
Arguments: 8@%rdi -8@%rsi
And if I run tplist -p
$ tplist -p ${PID} | grep global
/proc/12646/fd/7 global:hello_nsec
Or use my latest branch of bpftrace with wildcard USDT support:
$ bpftrace -l 'usdt:*:global:*' -p ${PID}
usdt:/proc/12646/fd/7:global:helloworld
I can attach to the probe by that path, or by PID (using my bcc branch)
If I disable the provider, the file descriptor is closed and the elf file is removed from the memory map.
I've marked this as ready for review, as the dependent patch has landed in BCC.
Awesome! I was not aware of memfd
, and this seems like a much better approach compared to creating a file in the filesystem. libusdt
doesn't have to write to a file first because Solaris/DTrace have an interface to register User Tracepoints.
IMO memfd should be the default, and we should fall back to tmpfs only if necessary. Also, it might be useful to have a way to choose which filesystem to use during runtime?
@mmarchini great - i'll remove the macro then, and make it the default, falling back only if we can't find the necessary call.
Also, it might be useful to have a way to choose which filesystem to use during runtime?
to do this I'll need to add a new method signature for providerLoad probably, to force using /tmp files. If the original interface is called, it will go through the fallback path.
I'll modify this PR accordingly :)
@dalehamel did you had time to look into this?
@mmarchini yeah i should be able to get this PR revived, I got the necessary plumbing added to iovisor/bcc a few months back in https://github.com/iovisor/bcc/pull/2314 but never circled back to this.
One practical issue though is that bpftrace 0.9.2 that ships with ubuntu 19.10 is linked against an older version of bcc prior to my patch for reading elf nodes from a memory backed FD, so bpftrace users won't be able to benefit from this until a newer version is widely available.
For that reason I think both the current temporary file and memory-backed file will need to be supported in tandem for a while
I've tried this PR on our internal Python wrapper and seems to work fine, It passes all our integration tests! 🎉
I was curious about performance and it seems to be the same as before. Intuitively I thought the performance would be better now. Just in case, will double check the simple benchmarks I wrote in the next couple of days
@mmarchini ok sorry for the super long delay here, I have made the following changes:
- The tests are run on both trusty and bionic. Trusty doesn't support memfd, so uses the old codepath (glibc too old). bionic does, and tests the new codepath.
- Support for memfd is detected automatically at compile time, if available it's used by default
- A new function is added to the public interface, to allow toggling the behavior of using memfd or no, on systems that support it.
I also found some memory leaks that I think I missed on the last pass / weren't running in the test suite before. I've fixed those - the valgrind test is really helpful :)
Can you please take another look?
Great! I might take a few weeks to get to it though. Just got back from vacation and I'm on a few conferences now, so my backlog is quite huge. Will try to prioritize it though, as it's a nice addition :)
Great! I might take a few weeks to get to it though. Just got back from vacation and I'm on a few conferences now, so my backlog is quite huge
No worries @mmarchini hope you had a restful vacation!
Hi @mmarchini a friendly ping, in case this has fallen off your radar and you can find some time to take a look
Hi! It would be great if this PR could be merged. At my workplace we currently have this custom patch applied but it would be great if we don't depend on custom patches 😄
hi @mmarchini i'll take a look at this shortly, need to refresh my own memory on this old PR but thank you for all your feedback.
@mmarchini I believe I've addressed your two points, please take a quick look at the two patches and then I'll happily re-file the PR against the new repo, otherwise feel free to cherry pick the changes in.
Last but not least, I do wonder if this should result in libstapsdt2, because it's technically a breaking change (more from the POV of the consumers than the programs running it though, since now older bcc/bpftrace versions won't be able to load the probes).
Yeah I suppose that is true, but I would guess if they are running an up-to-date libstapsdt, they probably have also updated their bcc since the change to support this has landed. I think that this could justify a new semantic version of the package regardless, though I'm not sure what the current practice for versioning this library is.
@dalehamel thank you! The patches lgtm. I'll cherry-pick, no need to open a new PR (I'll also rename MemFD_Option_t
to MemFDOption_t
for consistency while cherry picking).
Landed: https://github.com/linux-usdt/libstapsdt/commit/0d53f987b0787362fd9c16a93cdad2c273d809fc