libs
libs copied to clipboard
(driver-bpf) possible race conditions in bpf_fget in rare network heavy workloads?
Describe the bug
It seems like when the socket
system call is enabled in the eBPF kernel driver, sometimes during an unlucky Falco run sockets are not closing in the kernel. Such side-effects have only been observed in some apps with high number of network connections (>4M). It also appears to be of nature "heisenbug" - we do not have reproducible cases. When it happened and it was observed, sockets not closing quickly ramped up right after start-up, or something not being closed, causing reference counting issues with the file descriptor table?
Have been reading through some general kernel docs https://www.kernel.org/doc/Documentation/filesystems/files.txt and don't understand how it is aligned with the eBPF instrumentation when for example accessing the fdtable over bpf_probe_read.
Maybe it could be that occasionally something is taking too long in the bpf, something not being closed, causing ref cnt issues? I don't understand how it is locking the fdtable, if it is unlocking or releasing its copy of the fdtable in the bpf_fget function especially when compared to the Falco kernel module?
How to reproduce it
Bug is of nature "heisenbug" - making it hard to find what could occasionally be going wrong in apps with a very high number of network connections.
Expected behaviour
No adverse side-effects when socket system call tracing is enabled even when eBPF driver is under "heavy network load".
Additional context
Generic internet search seems to surface few bcc issues that are remotely related to accessing the fdtable using eBPF. However, doubt it's very useful for this particular issue.
https://github.com/iovisor/bcc/issues/237 https://github.com/iovisor/bcc/issues/2538
/area driver-bpf
Hi @incertum nice to see you again :hand: Uhm, this issue is quite strange as you said, I don't understand how the syscall socket
could be involved in this kind of issue, let me explain why...
As you can see here after starting the capture, a mock socket
syscall is called by the userspace in order to "calibrate the socket"... This is quite strange logic, the idea is to save the pointer to the socket file operations just the first time the socket
syscall is called, in this way we can understand if all the fd
used by other network syscalls are really socket or not.
Here you can see the logic for the socket
syscall, while here you can see the function that other network syscalls use to be sure the fd
is a socket.
So coming back to our issue, the first time a socket
syscall is called the BPF instrumentation saves the pointer to the socket file operations, but it performs this logic only the first time so the fdtable
is used only in this first call. All the other times the syscall socket
uses just the syscall parameters found in the registers and the return value. According to what I just said, I don't think that our socket
instrumentation could cause kernel resource leakage. I suspect that the problem could be in userspace instead or maybe in some other place in our instrumentation but I'm not sure about this we need more information to understand it, for example:
- Have you ever experienced this problem with the kernel module? (This could help us in understanding if the problem is in userspace or not)
- Which is the Falco version in question? Have you ever tried previous Falco versions?
Goal was to stress test Falco w/ eBPF (driver-bpf
) under extreme conditions on an older legacy test server. Succeeded in surfacing kernel monitoring implications and convoluted interdependencies, which should not come as a surprise given the complexity of the Linux kernel.
Summary Side-Effects:
Large spikes in packet discards at tool start-up and/or rapid increase or decrease in currently established TCP connections. The latter being of nature heisenbug, hard to trigger and only triggered under high load such as over 4M TCP connections.
Resolution:
- Not a Falco issue (issues triggered w/ non Falco eBPF)
- Not an inherent eBPF issue (issues triggered w/o eBPF)
- Kernel setting and optimization issue (issues triggered w/o tool)
Teamed up with maintainers to run a series of systematic tests. Thanks again so much everyone for your support ❤️
-
scap-open
tests with varying levels of eBPF instrumentation. New capabilities will be added to the project's scap-open example. They will be very useful for future performance tests. - non Falco simple eBPF instrumentation as sanity cross check. Thanks for @Andreagit97 open-sourcing https://github.com/Andreagit97/BPF-perf-tests
-
page_faults_1
- simple eBPF instrumentation, no data sent to userspace - never triggered issues -
page_faults_2
- simple eBPF instrumentation, sending data to userspace - triggered issues
-
Additional Insights:
- Symptoms and root cause can be very unrelated as seen on our lucky test server
- Turning a security monitoring tool on, seeing issues that disappear when turning the tool off can lead to wrong conclusions. There is a possibility that those issues are not directly related to the tool's implementation itself or even technology stack.
- Run a most simplistic "kernel monitoring" instrumentation that is not the tool under suspicion, but uses same technology stack -> in our case
page_faults_2
triggered the issues - Explore alternative ways to trigger the issues without the monitoring tool
- Run a most simplistic "kernel monitoring" instrumentation that is not the tool under suspicion, but uses same technology stack -> in our case
- Crucial to possible interference of eBPF kernel monitoring tools seems to be sending data to userspace (the more data the more pronounced interference can be), effectively making the tool vulnerable to causing issues on the host just like any other app can do. We therefore challenge "eBPF is safe" as it depends. Certainly showed that a server can be brought down through an eBPF based security monitoring instrumentation that consumes events in userspace.
- Made some interesting observations re the fact that more minimal userspace instrumentation such as the scap-open example made eBPF invocations higher / more effective and helped triggering the issues more often - future performance studies are needed to surface better insights and modern bpf may help address current limitations.
@Andreagit97 this issue can be closed.
Ei @incertum thank you very much for all these details, I'm sure that this issue will be very useful for other members of the community! I will close this, thank you again for the effort! :rocket: