falco icon indicating copy to clipboard operation
falco copied to clipboard

[DISCUSSION] New `base_syscalls.exclude_enter_exit_set` config

Open incertum opened this issue 8 months ago • 7 comments

Motivation

The hardware landscape is evolving towards models with 96, 128, or more CPUs. However, Falco currently faces usability challenges on such machines, particularly those dealing with heavy traffic, especially in network and file-related activities.

One potential solution could involve allowing end users to specify a subset of enter or exit syscall events they want to drop on the kernel side. This feature would be flagged as very risky to use, similar to the existing base_syscalls feature.

For instance, users might opt to drop enter syscall events for open* and connect syscalls, even though they are aware that doing so could expose them to TOCTOU attacks (mitigated by default via this PR). Nevertheless, this trade-off might be preferable to completely disabling Falco.

Feature

Introduce a new config base_syscalls.exclude_enter_exit_set, allowing exclusion of specific enter or exit events that are part of the custom_set syscalls. This exclusion is limited to scenarios where it makes sense for enter or exit events. Ensure good documentation.

Additional context

https://github.com/falcosecurity/libs/issues/1557

CC @falcosecurity/libs-maintainers

incertum avatar Dec 09 '23 22:12 incertum

@stevenbrz let's see if the other maintainers are on board. If yes, it could be a great "warm up" contribution for you to take on :wink:

incertum avatar Dec 09 '23 22:12 incertum

Yes, Falco doesn't scale on these huge servers and we need to find a possible solution to mitigate this case, one idea could be:

  1. adapt our sinsp state to be only populated by exit_events, enter_events are just needed to mitigate TOCTOU or in old kernel versions.
  2. when sinsp can reconstruct the state with only exit events, we can disable all enter events informing our users that this will turn Falco into a best-effort detection mode that could be vulnerable to some attacks. I would prefer to remove all enter events to reduce complexity instead of having a sort of simple consumer just for enter events :exploding_head:. This point will halve our kernel events, and this is already a great result.
  3. With event throughputs of 20 milions/s the previous point is not enough, we will obtain 10 milions/s but Falco cannot handle it, so we need a sort of hash table in the drivers to filter exit events. My idea would be to expose some API in sinsp that allow different filters (on the comm, on the exepath, on the cmdline,...) These filters are evaluated in userspace when we read the event from the next (if we have a match we add the pid of this process inside the hash table used by the drivers so the following events will be excluded kernel side). Of course, we need to evaluate how many filters we can process because it could be quite heavy. Moreover, I would avoid filtering clone/execve/proc_exit events, we have already seen these don't cause perf overhead and we need them to keep a reliable process tree inside sinsp.

This is just an idea but maybe it could work

Andreagit97 avatar Dec 13 '23 10:12 Andreagit97

Moreover, I would avoid filtering clone/execve/proc_exit events, we have already seen these don't cause perf overhead and we need them to keep a reliable process tree inside sinsp.

Big +1 those aren't an issue.

incertum avatar Dec 13 '23 16:12 incertum

I'm in support of this.

cccsss01 avatar Jan 13 '24 03:01 cccsss01

I'm in favor of investigating this front :+1:

leogr avatar Jan 15 '24 09:01 leogr

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Apr 14 '24 09:04 poiana

/remove-lifecycle stale

incertum avatar Apr 14 '24 19:04 incertum

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Jul 13 '24 21:07 poiana

/remove-lifecycle stale

Andreagit97 avatar Jul 15 '24 07:07 Andreagit97