falco
falco copied to clipboard
OOM on physical servers
Describe the bug
Pn 0.34.x releases we do experience mem leak on physical instances, while the same setup on AWS is fine. It could be due node workload, but still its clear mem leak.
Actually as of now RC not identified,
- looking for help to do some mem profile or debug the issue
- anyone with similar behavior?
How to reproduce it
This is bit customised deployment (not helm, etc.)
This is the config falco is given (we do use more rules, but the problem happens with only upstream ones (now the rules from rules repo)
data:
falco.yaml: |
rules_file:
- /etc/falco-upstream/falco_rules.yaml
- /etc/falco/rules.d
plugins:
- name: json
library_path: libjson.so
init_config: ""
load_plugins: []
watch_config_files: true
time_format_iso_8601: false
json_include_output_property: true
json_include_tags_property: true
json_output: true
log_stderr: true
log_syslog: false
# "alert", "critical", "error", "warning", "notice", "info", "debug".
log_level: error
libs_logger:
enabled: false
severity: debug # "info", "debug", "trace".
priority: warning
buffered_outputs: false
syscall_buf_size_preset: 4
syscall_event_drops:
threshold: 0.1
actions:
- log
rate: 0.03333
max_burst: 1
simulate_drops: false
syscall_event_timeouts:
max_consecutives: 1000
webserver:
enabled: true
k8s_healthz_endpoint: /healthz
listen_port: 64765
ssl_enabled: false
ssl_certificate: /volterra/secrets/identity/server.crt
threadiness: 0
#k8s_audit_endpoint: /k8s-audit
output_timeout: 2000
outputs:
rate: 1
max_burst: 1000
syslog_output:
enabled: false
file_output:
enabled: false
keep_alive: false
filename: ./events.txt
stdout_output:
enabled: true
program_output:
enabled: false
keep_alive: false
program: "jq '{text: .output}' | curl -d @- -X POST https://hooks.slack.com/services/XXX"
http_output:
enabled: true
url: "http://falco-sidekick.monitoring.svc.cluster.local:64801/"
user_agent: falcosecurity/falco
grpc:
enabled: false
bind_address: unix:///run/falco/falco.sock
threadiness: 0
grpc_output:
enabled: false
metadata_download:
max_mb: 100
chunk_wait_us: 1000
watch_freq_sec: 1
modern_bpf:
cpus_for_each_syscall_buffer: 2"
Expected behaviour
Drop memory at regular intervals
Screenshots
Cloud instances of falco on AWS: (ok behaviour, screenshot is imo on 0.33.x version)

Instances on physical servers: ( OOM, on 0.34.1, the nodes in the cluster are exactly the same, though, only 2 of 4 are affected by mem increase (could be due specific workload). Surprisingly same metric does not match the pattern from AWS/GCP nodes (above)

Environment
K8s, falco in container Physical server, under load
- Falco version:
{"default_driver_version":"4.0.0+driver","driver_api_version":"3.0.0","driver_schema_version":"2.0.0","engine_version":"16","falco_version":"0.34.1","libs_version":"0.10.4","plugin_api_version":"2.0.0"}
- System info:
{
"machine": "x86_64",
"nodename": "master-1",
"release": "4.18.0-240.10.1.ves1.el7.x86_64",
"sysname": "Linux",
"version": "#1 SMP Tue Mar 30 15:02:49 UTC 2021"
}
- Cloud provider or hardware configuration:
- OS:
/etc/os-releasenot relevant, it's basically centos but customised - Kernel:
root@master-1:/# uname -a
Linux master-1 4.18.0-240.10.1.ves1.el7.x86_64 #1 SMP Tue Mar 30 15:02:49 UTC 2021 x86_64 GNU/Linux
- Installation method: K8s, custom manifests - described on some older issue here: https://github.com/falcosecurity/falco/issues/1909#issuecomment-1195153675
ei @epcim thank you for reporting this! Of course, I've not a right answer here but I have some questions that could help us to dig into it:
- Which engine are you using
bpf,kernel module, ormodern bpf? From your config I can see that you are not using plugins, is it true? - If I understood well you are experiencing mem-leak only on physical servers, while with the same/similar setup, Falco works well on AWS. Have you ever seen this kind of issue with previous Falco versions (
0.33.0for example)? - You said that you are able to reproduce the issue with just the default ruleset. It would help us a lot if you were able to find a minimal set of rules that could reproduce the issue. The ruleset is huge and involves many syscalls so working on a minimal set would speed up the troubleshooting
ha, that I forgot to mention. @Andreagit97
We experience the issue with kernel moduel on production env. and it was reproduced either with kernel moduel and bpf on the test env's.
havent tried modern bpf
this image change/upstream:
-FROM falcosecurity/falco-no-driver:0.32.2
+FROM falcosecurity/falco-no-driver:0.33.1
- I do believe issue is there since
0.33.0before hard to say, not visible in metrics (as container could be killed early). From metrics (from dev env): 1y view:
interesting is this part - mem hiked significantly

this part correspond: to 0.32.2 -> 0.33.1 - on 29.11.2022

better detail even shows it was 9d quite OK, and then started to hike

cross/verified with prod env, the version landed on 6.12.2022 but the the issue was first time visible straight on 24.1.2023 (appears on 1 node from dozens) but it has the same pattern:

for comparison period 12/2022 - 04/2023 with all nodes on prod - is hard to read/identify the issue just from metrics, pods are killed early etc.. later on January we have increased mem limit for pod and the mem hike started to be recognisable on metrics

- minimalise the ruleset
- Yes that's what I am thinking of now as next step
- Other question is what thread internally could live for such a long time (well from my side it's visible that the issue appears on nodes that are freq. accessed over ssh (so shell sessions, started docker containers are sys-calls I would blame first).
- At best if you would assume what changes happen in codebase between
0.32.2->0.33.1
Thank you these are really interesting info that could help us in the troubleshooting! We will start to take a look, keep us updated if you find something new!
surprisingly, there were not much changes in rules since October 2022, new is on right side, while left is some my November version (which basically is 1:1 with October).
these that changed, as far as I know (ignoring tags, lists, or proc name in .. (new binaries), etc.., and all removed)

(and similar)



not in my current code base that has the issue




ei thank you for the update! The bad news is that since the underlying Falco code is changed the OOM issue could be caused also by an already existing rule :/ so unfortunately we cannot restrict the investigation scope
/milestone 0.35.0
@jasondellaluce will you try to simulate the issue on your side and collect metrics before 0.35?
@epcim This issue is hard to reproduce on our side, and I think deeper testing on this specific path will not happen before 0.35. However, we're testing the latest dev Falco also with tools like Valgrind and in the most common deployment scenarios, so my suggestion will be to try out 0.35 once it's out and see if the issue still occurs. It's hard to tell if the issue is caused by your rule setup, your workload, or by a mix of the two. The most likely thing is that this could be happening within libsinsp, and that very specific workloads force the library to grow its internal state unbounded. This will definitely require further investigation.
Same every increasing memory consumption with 0.35 (upgraded from 0.33), but our falco setup is a bit different than the one described in this issue. Deployed as a systemd unit on a VM (own hosts, so no cloud stuff), syscalls disabled (--disable-source syscall), only k8saudit/json plugins enabled, default syscall rules removed, grpc/http/syslog output enabled.
Can't make a memory dump, because falco claims 132G virt (VM has 6GB RAM and 30GB disk.... no idea why it needs this much virt) and it seems a memory dump is trying to write 132G to disk, which obviously fails on a 30GB disk.
@sboschman do you also reproduce this kind of memory usage when running Falco for syscalls collection, without plugins?
@jasondellaluce we do not run falco with syscalls collection enabled at all, so not a use-case I can comment on.
/milestone 0.36.0
@epcim would you be in a position to re-run some test with eBPF and libbpf stats kernel setting enabled with Falco's new experimental native metrics? Asking because I would be curious to see if spikes in memory correlate with surges in event rates (both at the tracepoints aka the libbpf stats and also in userspace which obviously depends on the syscalls you enable). Please feel free to anonymize logs and/or share an anonymized version of it on slack in a DM. What we unfortunately don't yet have in the metrics feature are the detailed syscalls counters and some other internal state related stats we aim to add for the next Falco release.
In addition @epcim could we get more information around the cgroups version on these machines? Memory counting in the kernel can in many cases be just wrong. For example see https://github.com/kubernetes-sigs/kind/issues/421 and I have also heard rumors about cgroups leaking memory. cgroups v2 has superior memory management, hence would be curious to know which cgroups version you are dealing with?
Plus you also have that on host deployment, mind getting me up to speed about the exact memory metrics you base OOM for those cases (aka the non container_memory_working_set_byte cases)? Apologies if you posted that already above and I just couldn't read everything.
Thanks in advance!
I'm seeing the same behavior using Falco 0.35.1. Running on AKS, a single Node using Ubuntu 22.04. Falco deployed using Helm with the default rule set. Pod memory is constantly increasing, the cluster is practically idle (no other workloads running). falco.yaml:
falco:
log_level: debug
syscall_event_drops:
rate: 1000000000
max_burst: 1000000000
json_output: true
json_include_output_property: true
file_output:
keep_alive: false
enabled: false
grpc_output:
enabled: true
grpc:
enabled: true
driver:
enabled: true
kind: modern-bpf
Just out of curiosity this particular host is running kernel cgroups v1 or cgroups v2? Thank you!
We will investigate the cgroups related memory metrics the OOM killer uses more, also @sboschman use case where the binary is only used for k8saudit logs filtering, meaning in that scenario most of the libs code is not used (no kernel driver, no sinsp state, no container engine, basically no allocations etc).
Edit:
And maybe also show RSS memory metric over time.
@incertum the host is running cgroups v2:
# stat -fc %T /sys/fs/cgroup/
cgroup2fs
I am experimenting with the effect of rules configuration on this. It seems that disabling all rules doesn't reproduce the issue, so I'm trying to understand if I can isolate it to specific rule/s.
Hi @emilgelman thanks this is great news you have cgroups v2. By the way we now also have the base_syscalls config in falco.yaml for radical syscalls monitoring control, check it out.
However, I think here we need to investigate in different places more drastically (meaning going back to the drawing board) as it has also been reported for plugins only. In that case we merely do event filtering in libsinsp, so most of the libsinsp complexity does not apply which kind of narrows down the search space.
I am going to prioritize 👀 into it, it likely will take some time.
In addition, in case you are curious to learn more about the underlying libs and kernel drivers with respect to memory:
- Yes we do build up a process cache table in libsinsp, but we also hook into the scheduler process exit tracepoint to purge items from the table again, else the memory would skyrocket in no time.
- The same applies for the container engine, therefore I suspect it must be something much more subtle while still being event driven.
- Then there is the discussion around absolute memory usage regardless of time drifts. For example, we learned the hard way that the new eBPF ring buffer wrongly accounts memory twice, check out our conversation with the kernel mailing list. Adjusting parameters such as
syscall_buf_size_presetandmodern_bpf.cpus_for_each_syscall_buffercan help, again this is just some more insights a bit unrelated to the fact that we are investigating subtle drifts over time in this issue. I am also still hoping to one day meet someone who knows all the answers re Linux kernel memory management and accounting, often it's not even clear what the right metric is and if the metric is accounting memory in a meaningful way.
Simulated a noisy Falco config on my developer Linux box. Enabling most supported syscalls was sufficient to simulate memory issues:
- rule: test
desc: test
condition: evt.type!=close
enabled: true
output: '%evt.type %evt.num %proc.aname[5] %proc.name %proc.tty %proc.exepath %fd.name'
priority: NOTICE
Using valgrind massif heap profiler:
sudo insmod driver/falco.ko
sudo valgrind --tool=massif \
userspace/falco/falco -c ../../falco.yaml -r ../../falco_rules_test.yaml > /tmp/out
massif-visualizer massif.out.$PID
Reading the tbb API docs https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Concurrent_Queue_Classes.html, we use the following variant ... By default, a concurrent_bounded_queue is unbounded. It may hold any number of values, until memory runs out. ... and currently we do not set a safety capacity, or better expose it as parameter.
Here is a staging branch to correct this: https://github.com/incertum/falco/tree/queue-capacity-outputs, what do you all think?
However, the root cause is rather the entire event flow being too slow, basically we don't get to pop in time from the queue in these extreme cases, because we are seeing timeouts and also noticed heavy kernel side drops. Basically the pipe is just not holding up when trying to monitor so many syscalls even just on a more or less idle laptop. I would suggest we should re-audit the entire Falco processing and outputs engine and look for improvement areas, because when I did the same profiling with the libs sinsp-example binary, memory and output logs were pretty stable over time ...
Reading the tbb API docs https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Concurrent_Queue_Classes.html, we use the following variant
... By default, a concurrent_bounded_queue is unbounded. It may hold any number of values, until memory runs out. ...and currently we do not set a safety capacity, or better expose it as parameter.
The rationale for an unbounded queue was that the output consumer must be responsive enough to accept all the alerts produced by Falco. When the output consumer is too slow, a dedicated watchdog will emit an error message in the stderr to notify the user that the configure output consumer is too slow or blocked. By design, this is the only case when the queue can grow indefinitely.
Otherwise, if the memory is growing but the queue is not, there might be just an implementation bug. Have you checked that? :thinking:
Thanks @leogr all of the above is true. And for everyone reading this, unbounded queues can be a good choice and more efficient anyways if you have other controls prior.
The queue filling up is one very likely cause for memory growth in real-life production. At the same time there can always be more bugs in other places. Using the heap profiler on my laptop added enough overhead / slowness to show these symptoms when having that one noisy Falco rule. Have yet to get deeper into profiling.
My current recommendations:
Here I would expose a queue capacity to the end user and add a default value. Have it "Experimental" so we could remove it again should we find much better ways of handling heavy event pipes in future Falco releases.
We still need to discuss the recover strategy:
- exit aka a self imposed OOM kill?
- swap the queue aka drop hoping we recover from some outlier bursts? I suspect if we get here we likely are already heavily dropping kernel side because of a full kernel->userspace buffer (meaning Falco is already not working) and there are existing notifications as @leogr you highlighted. I would try this option.
Sadly none of this is a solution to get Falco to work on such more heavy production servers or workload types. Opening a new ticket to discuss a re-audit of the Falco specific outputs handling https://github.com/falcosecurity/falco/issues/2691. Pragmatic expected outcomes are that perhaps we can improve things, however I doubt all problems will magically disappear, because we can't scale horizontally (throw more compute at the problem what is typically down in for example big data stream processing). In fact, folks want a security tool to almost consume no CPU and memory, but never drop events.
Considering Falco's primary use of alerting on abnormal behavior I project that having smarter advanced anomaly detection approaches could be more promising to avoid having to deal with bursty outputs in the first place, but maybe I am biased 🙃 .
Meanwhile, adopters can re-audit the syscalls they monitor (using the new base_syscalls option) and consider tuning Falco rules more. It may help with the problems described here.
I opened the PR to expose the configs to set a custom capacity.
I was busy last few weeks but count to reconfigure/test next weeks all the findings on thread.
Perfect, yes I would suggest to first try the option of being able to set a queue capacity and after test deployments we shall see if there are other issues still in terms of memory actually leaking / increasing radically over time beyond expected limits. At least the simulation above shows that this is something that currently could happen vs with the capacity in the simulation I at least didn't observe memory leaking.
At the same time reminder this is not fixing the root cause, see https://github.com/falcosecurity/falco/issues/2495#issuecomment-1646037752
In addition, we may need to experiment with best default values across the various settings that can control the outputs ...
@epcim Can you try your initial config (the one pasted in the opening post), but disabling http_output?
This is a wild guess, but it's worth a try!
Thank you (also, we are working hard to understand and reproduce the isse :) )
Update: We merged the outputs_queue PR that exposes the queue capacity (available for testing in the upcoming Falco 0.36 release) and I started a new issue in libs for now https://github.com/falcosecurity/libs/issues/1347 to track metrics that may still be missing to get of the bottom of current and possible new future issues.
Hey, did anyone upgrade to Falco 0.36.0 and notice the issue still (even playing with the new queue capacity options?)
Yes, upgraded to 0.36.0 last week. The falco container is still getting OOMkilled by kubernetes/cgroups (Last state: Terminated with 137: OOMKilled) with the default queue capacity config.
Unfortunately I have a hard time exposing the stats/metrics to our TSDB.
@sboschman I recall you use plugins only, there should be a regression because of a libscap refactor and metrics should be partially broken for you now :/
https://github.com/falcosecurity/falco/issues/2821 plus we need a patch release for metrics https://github.com/falcosecurity/falco/issues/2850
@FedeDP I am on it, will post updates once I have them.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale