ebpf
ebpf copied to clipboard
Enable cleanup of tracefs entries (kprobe_events)
As raised by @etherandrius: https://cilium.slack.com/archives/CKC55JL8L/p1618310807065500.
Currently, the group prefix for entries in kprobe_events is hardcoded to ebpf_, resulting in identifiers like ebpf_aabbccdd. This might prove inflexible in some environments where programs are frequently SIGKILLed and stale entries are left behind in <tracefs>/kprobe_events.
The main use case would be enabling cleanup: allowing the caller to evict entries from kprobe_events as they see fit, without having to resort to removing all entries starting with ebpf_, possibly disrupting other kprobe tools written using this library.
@lmb suggested using filepath.Base(os.Args[0]) + os.Getpid() for the group name, which is sufficiently identifiable, and the PID could eventually be correlated using logs etc.
A few caveats:
- Group names allow up to 63 characters, so the name of the binary would need to be truncated for safety.
- PIDs (integers) have a variable-length string representation, so they would need to be padded to a static length.
- The library currently makes the assumption that creating multiple identical probes will succeed because the group identifier is random on each attempt. When using the (static) PID instead, repeated kprobe creations will use the same group and name, preventing multiple attachments to the same symbol from the same program. We're no longer tracking Trace Events as separate entities that can have multiple perf events created from them, so this means we'll always need a random component in the group name, and PID won't do.
While there are drawbacks to all approaches (maybe someone should just backport kprobe PMU to 4.x kernels instead... :sweat_smile:), which one should we pick?
- Kprobe options:
link.Kprobe("printk", prog, &LinkOptions{TraceFSPrefix: "foo"}) - Package-level prefix override:
link.TraceFSPrefix(prefix string)(this will mess with tests) - No override, but always use the truncated, sanitized name of the current binary instead of
ebpf. This will bite in various other ways, e.g. some users naming their binariesappwhen dropping them in containers. - Any other ideas?
Additionally, should a cleanup method be provided? Something like link.DeleteTraceFSPrefix("foo") that removes all probes containing the given prefix? This breaks our abstraction, but we need to provide this if we ever plan on changing the group/prefix format.
Hey there 👋🏻
Not sure how much help this is, but a few things to keep in mind:
-
if the process that loaded the eBPF programs gets killed, it won't be able to properly clean up
kprobe_events, and the entries might eventually add up to 65k (kernel limit -> see here). There should probably be a mechanism to clean up existing entries of a predefined pattern to prevent this from happening. Cleaning 65k entries at once takes a while and is definitely best avoided. -
(If the PID is chosen as the random part of the name) PIDs may be reused, if the random component is deleted, the library should allow entries to be reused (skip the write in
kprobe_events, and jump directly to theperf_event_opensyscall with the existing kprobe ID). -
if a custom prefix is provided, it should be cleaned up to avoid invalid characters (a simple
[a-zA-Z0-9]regex should be enough).
Let me know if I can do anything else to help !
@Gui774ume Thanks for the feedback! Some follow-up questions:
Cleaning 65k entries at once takes a while and is definitely best avoided.
What's the alternative to 65k entries at once? :sweat_smile: The way I see it, a cleanup operation should be executed once at startup, not after being OOMkilled for a few hours.
The way I see it, a cleanup operation should be executed once at startup, not after being OOMkilled for a few hours.
Yup that's it. Before adding anything to kprobe_events we should check if existing entries should be deleted first. The way we currently do it is by making sure that the pid associated to the entry is still active. If not, we delete the entry. Once this clean up is done, our lib moves on to registering new entries 😃
This might help: https://github.com/DataDog/ebpf/blob/master/manager/manager.go#L1458
(everything we talked about should also apply to uprobe_events btw)
by making sure that the pid associated to the entry is still active
Note, this was recently discovered to be an issue when PID namespaces are in-use. This is because the PID as part of the event name is from the context of whoever created that event, which may be in a different PID namespace from the current process using this library.
Thanks for the update. I actually think that this is a pretty tricky problem to fix, so I'll remove the good first issue label.
There is a workaround now, so I think we can close this issue.