libs icon indicating copy to clipboard operation
libs copied to clipboard

(driver-bpf) possible race conditions in bpf_fget in rare network heavy workloads?

Open incertum opened this issue 1 year ago • 2 comments

Describe the bug

It seems like when the socket system call is enabled in the eBPF kernel driver, sometimes during an unlucky Falco run sockets are not closing in the kernel. Such side-effects have only been observed in some apps with high number of network connections (>4M). It also appears to be of nature "heisenbug" - we do not have reproducible cases. When it happened and it was observed, sockets not closing quickly ramped up right after start-up, or something not being closed, causing reference counting issues with the file descriptor table?

Have been reading through some general kernel docs https://www.kernel.org/doc/Documentation/filesystems/files.txt and don't understand how it is aligned with the eBPF instrumentation when for example accessing the fdtable over bpf_probe_read.

Maybe it could be that occasionally something is taking too long in the bpf, something not being closed, causing ref cnt issues? I don't understand how it is locking the fdtable, if it is unlocking or releasing its copy of the fdtable in the bpf_fget function especially when compared to the Falco kernel module?

How to reproduce it

Bug is of nature "heisenbug" - making it hard to find what could occasionally be going wrong in apps with a very high number of network connections.

Expected behaviour

No adverse side-effects when socket system call tracing is enabled even when eBPF driver is under "heavy network load".

Additional context

Generic internet search seems to surface few bcc issues that are remotely related to accessing the fdtable using eBPF. However, doubt it's very useful for this particular issue.

https://github.com/iovisor/bcc/issues/237 https://github.com/iovisor/bcc/issues/2538

incertum avatar Jul 09 '22 00:07 incertum

/area driver-bpf

incertum avatar Jul 09 '22 00:07 incertum

Hi @incertum nice to see you again :hand: Uhm, this issue is quite strange as you said, I don't understand how the syscall socket could be involved in this kind of issue, let me explain why...

As you can see here after starting the capture, a mock socket syscall is called by the userspace in order to "calibrate the socket"... This is quite strange logic, the idea is to save the pointer to the socket file operations just the first time the socket syscall is called, in this way we can understand if all the fd used by other network syscalls are really socket or not.

Here you can see the logic for the socket syscall, while here you can see the function that other network syscalls use to be sure the fd is a socket.

So coming back to our issue, the first time a socket syscall is called the BPF instrumentation saves the pointer to the socket file operations, but it performs this logic only the first time so the fdtable is used only in this first call. All the other times the syscall socket uses just the syscall parameters found in the registers and the return value. According to what I just said, I don't think that our socket instrumentation could cause kernel resource leakage. I suspect that the problem could be in userspace instead or maybe in some other place in our instrumentation but I'm not sure about this we need more information to understand it, for example:

  • Have you ever experienced this problem with the kernel module? (This could help us in understanding if the problem is in userspace or not)
  • Which is the Falco version in question? Have you ever tried previous Falco versions?

Andreagit97 avatar Jul 10 '22 10:07 Andreagit97

Goal was to stress test Falco w/ eBPF (driver-bpf) under extreme conditions on an older legacy test server. Succeeded in surfacing kernel monitoring implications and convoluted interdependencies, which should not come as a surprise given the complexity of the Linux kernel.

Summary Side-Effects:

Large spikes in packet discards at tool start-up and/or rapid increase or decrease in currently established TCP connections. The latter being of nature heisenbug, hard to trigger and only triggered under high load such as over 4M TCP connections.

Resolution:

  • Not a Falco issue (issues triggered w/ non Falco eBPF)
  • Not an inherent eBPF issue (issues triggered w/o eBPF)
  • Kernel setting and optimization issue (issues triggered w/o tool)

Teamed up with maintainers to run a series of systematic tests. Thanks again so much everyone for your support ❤️

  • scap-open tests with varying levels of eBPF instrumentation. New capabilities will be added to the project's scap-open example. They will be very useful for future performance tests.
  • non Falco simple eBPF instrumentation as sanity cross check. Thanks for @Andreagit97 open-sourcing https://github.com/Andreagit97/BPF-perf-tests
    • page_faults_1 - simple eBPF instrumentation, no data sent to userspace - never triggered issues
    • page_faults_2 - simple eBPF instrumentation, sending data to userspace - triggered issues

Additional Insights:

  • Symptoms and root cause can be very unrelated as seen on our lucky test server
  • Turning a security monitoring tool on, seeing issues that disappear when turning the tool off can lead to wrong conclusions. There is a possibility that those issues are not directly related to the tool's implementation itself or even technology stack.
    • Run a most simplistic "kernel monitoring" instrumentation that is not the tool under suspicion, but uses same technology stack -> in our case page_faults_2 triggered the issues
    • Explore alternative ways to trigger the issues without the monitoring tool
  • Crucial to possible interference of eBPF kernel monitoring tools seems to be sending data to userspace (the more data the more pronounced interference can be), effectively making the tool vulnerable to causing issues on the host just like any other app can do. We therefore challenge "eBPF is safe" as it depends. Certainly showed that a server can be brought down through an eBPF based security monitoring instrumentation that consumes events in userspace.
  • Made some interesting observations re the fact that more minimal userspace instrumentation such as the scap-open example made eBPF invocations higher / more effective and helped triggering the issues more often - future performance studies are needed to surface better insights and modern bpf may help address current limitations.

@Andreagit97 this issue can be closed.

incertum avatar Aug 25 '22 21:08 incertum

Ei @incertum thank you very much for all these details, I'm sure that this issue will be very useful for other members of the community! I will close this, thank you again for the effort! :rocket:

Andreagit97 avatar Aug 26 '22 07:08 Andreagit97