libstapsdt icon indicating copy to clipboard operation
libstapsdt copied to clipboard

Support for memory-backed file descriptor

Open dalehamel opened this issue 5 years ago • 15 comments

This abstracts the creation of the temporary ELF file to allow for two possible approaches:

  • The current (default) approach of allocating a temporary file to /tmp and dlopen'ing it
  • A new (optional) approach, to use a memory-backed file descriptor, if LIBSTAPSDT_MEMORY_BACKED_FD is specified.

The new approach his beneficial for two reasons:

  • There is no need to clean up the file, it is automatically removed when the process dies (the primary reason for this approach).
  • No calls are made to the underlying filesystem, everything is done in memory. While tmpfs is probably already memory-based, this could be advantageous in some circumstances.

libusdt does something similar, where it just valloc's directly into the processes address space, so there is no need for a temporary file to clean up.

Overall, I find that doing tho whole thing in memory and having it be automatically cleaned up with the process's lifecycle is a bit more elegant.

The drawbacks to this approach are:

  • ~It is not yet supported by bcc, but I've submitted https://github.com/iovisor/bcc/pull/2314 to show how this could be utilized.~ Support has been added to BCC by https://github.com/iovisor/bcc/pull/2314
  • If a large number of providers are allocated, the process will have a large number of file descriptors open. This is probably not a problem in practice.
  • Support for memfd_create was not added until linux 3.17 (not a problem in practice, as BCC needs 4.1+ to do anything meaningful with eBPF) and glibc 2.27. The system call may not be defined on some systems.

I think the drawbacks are mitigated by hiding it behind a preprocessor macro check, as the behavior remains unchanged. In the future, hopefully once https://github.com/iovisor/bcc/pull/2314 is in a major bcc release, we could remove the macro and default to a memory-based shared object, falling back to temporary file if we detect that memfd_create is not supported.

Attaching a probe (global:helloworld) this way, the resulting process looks like:

/proc/${PID}/maps:

...
7f9e79f0c000-7f9e79f0d000 r-xp 00000000 00:05 11706507                   /memfd:libstapsdt:global (deleted)
7f9e79f0d000-7f9e7a10c000 ---p 00001000 00:05 11706507                   /memfd:libstapsdt:global (deleted)
7f9e7a10c000-7f9e7a10d000 rw-p 00000000 00:05 11706507                   /memfd:libstapsdt:global (deleted)
...

And if we check out the file descriptors for the process:

lrwx------ 1 dale.hamel dale.hamel 64 Apr 14 20:57 7 -> '/memfd:libstapsdt:global (deleted)'

So if we check for elf notes on that fd:

$readelf --notes /proc/${PID}/fd/7

Displaying notes found in: .note.stapsdt
  Owner                 Data size       Description
  stapsdt              0x00000039       NT_STAPSDT (SystemTap probe descriptors)
    Provider: global
    Name: helloword
    Location: 0x0000000000000260, Base: 0x0000000000000318, Semaphore: 0x0000000000000000
    Arguments: 8@%rdi -8@%rsi

And if I run tplist -p

$ tplist -p ${PID} | grep global
/proc/12646/fd/7 global:hello_nsec

Or use my latest branch of bpftrace with wildcard USDT support:

$ bpftrace -l 'usdt:*:global:*' -p ${PID}
usdt:/proc/12646/fd/7:global:helloworld

I can attach to the probe by that path, or by PID (using my bcc branch)

If I disable the provider, the file descriptor is closed and the elf file is removed from the memory map.

dalehamel avatar Apr 15 '19 01:04 dalehamel

I've marked this as ready for review, as the dependent patch has landed in BCC.

dalehamel avatar Apr 16 '19 17:04 dalehamel

Awesome! I was not aware of memfd, and this seems like a much better approach compared to creating a file in the filesystem. libusdt doesn't have to write to a file first because Solaris/DTrace have an interface to register User Tracepoints.

IMO memfd should be the default, and we should fall back to tmpfs only if necessary. Also, it might be useful to have a way to choose which filesystem to use during runtime?

mmarchini avatar Apr 22 '19 20:04 mmarchini

@mmarchini great - i'll remove the macro then, and make it the default, falling back only if we can't find the necessary call.

Also, it might be useful to have a way to choose which filesystem to use during runtime?

to do this I'll need to add a new method signature for providerLoad probably, to force using /tmp files. If the original interface is called, it will go through the fallback path.

I'll modify this PR accordingly :)

dalehamel avatar Apr 22 '19 20:04 dalehamel

@dalehamel did you had time to look into this?

mmarchini avatar Oct 09 '19 20:10 mmarchini

@mmarchini yeah i should be able to get this PR revived, I got the necessary plumbing added to iovisor/bcc a few months back in https://github.com/iovisor/bcc/pull/2314 but never circled back to this.

One practical issue though is that bpftrace 0.9.2 that ships with ubuntu 19.10 is linked against an older version of bcc prior to my patch for reading elf nodes from a memory backed FD, so bpftrace users won't be able to benefit from this until a newer version is widely available.

For that reason I think both the current temporary file and memory-backed file will need to be supported in tandem for a while

dalehamel avatar Oct 09 '19 20:10 dalehamel

I've tried this PR on our internal Python wrapper and seems to work fine, It passes all our integration tests! 🎉

I was curious about performance and it seems to be the same as before. Intuitively I thought the performance would be better now. Just in case, will double check the simple benchmarks I wrote in the next couple of days

javierhonduco avatar Dec 31 '19 16:12 javierhonduco

@mmarchini ok sorry for the super long delay here, I have made the following changes:

  • The tests are run on both trusty and bionic. Trusty doesn't support memfd, so uses the old codepath (glibc too old). bionic does, and tests the new codepath.
  • Support for memfd is detected automatically at compile time, if available it's used by default
  • A new function is added to the public interface, to allow toggling the behavior of using memfd or no, on systems that support it.

I also found some memory leaks that I think I missed on the last pass / weren't running in the test suite before. I've fixed those - the valgrind test is really helpful :)

Can you please take another look?

dalehamel avatar Feb 03 '20 17:02 dalehamel

Great! I might take a few weeks to get to it though. Just got back from vacation and I'm on a few conferences now, so my backlog is quite huge. Will try to prioritize it though, as it's a nice addition :)

mmarchini avatar Feb 06 '20 03:02 mmarchini

Great! I might take a few weeks to get to it though. Just got back from vacation and I'm on a few conferences now, so my backlog is quite huge

No worries @mmarchini hope you had a restful vacation!

dalehamel avatar Feb 06 '20 15:02 dalehamel

Hi @mmarchini a friendly ping, in case this has fallen off your radar and you can find some time to take a look

dalehamel avatar Jun 26 '20 04:06 dalehamel

Hi! It would be great if this PR could be merged. At my workplace we currently have this custom patch applied but it would be great if we don't depend on custom patches 😄

javierhonduco avatar Sep 24 '21 14:09 javierhonduco

hi @mmarchini i'll take a look at this shortly, need to refresh my own memory on this old PR but thank you for all your feedback.

dalehamel avatar Oct 18 '21 15:10 dalehamel

@mmarchini I believe I've addressed your two points, please take a quick look at the two patches and then I'll happily re-file the PR against the new repo, otherwise feel free to cherry pick the changes in.

Last but not least, I do wonder if this should result in libstapsdt2, because it's technically a breaking change (more from the POV of the consumers than the programs running it though, since now older bcc/bpftrace versions won't be able to load the probes).

Yeah I suppose that is true, but I would guess if they are running an up-to-date libstapsdt, they probably have also updated their bcc since the change to support this has landed. I think that this could justify a new semantic version of the package regardless, though I'm not sure what the current practice for versioning this library is.

dalehamel avatar Oct 18 '21 18:10 dalehamel

@dalehamel thank you! The patches lgtm. I'll cherry-pick, no need to open a new PR (I'll also rename MemFD_Option_t to MemFDOption_t for consistency while cherry picking).

mmarchini avatar Oct 18 '21 23:10 mmarchini

Landed: https://github.com/linux-usdt/libstapsdt/commit/0d53f987b0787362fd9c16a93cdad2c273d809fc

mmarchini avatar Oct 25 '21 04:10 mmarchini