falco
falco copied to clipboard
Falco creates a lot of processes
Describe the bug
Hi! I have about 100 nodes where Falco is installed. Some of them are starting to have problems. The problem exists only for nodes in clusters (Vault, Kafka). I am not sure if I have correctly identified the point at which the problem occurred, but it seems to be a change of leader in the cluster. When Vault cluster changes leader, Falco starts to create a lot of new processes (9000+) and my monitoring starts to send alerts about it. I don't think this is normal Faclo behavior, because on all other nodes I haven't seen that many Falco processes.
In addition, I see that Falco did not work at all since the number of processes spiked and there are no new logs in /var/log/falko.log. At the same time the systemd service looks healthy. If I restart the service, Falco starts successfully and runs until the next incident.
UPD: I starts Falco as systemd service. Only Falco v0.35.0 has this issue. Falco v0.32.0 did not have this issue
How to reproduce it
Start Falco as service on cluster node (Kafka, Vault) (?)
Screenshots
Data from journalctl -u falco
Environment
- Falco version: 0.35.0
- OS: Ubuntu 20
- Kernel: 5.4.0-152-generic
- Installation method: apt package
Additional context
Today I tried to run Falco without systemd service on Vault nodes and so far the problem has not reproduced (it has been several hours).
Just to provide some more context from my conversations with Konstantin. It looks like the falco service keeps trying to relaunch a falco instance. That instance ultimately never gets to run as it crashes right away with an error:
An error occurred in an event source, forcing termination...
Hey! Thank you for this very detailed issue! We will look into it asap!
Today I tried to run Falco without systemd service on Vault nodes and so far the problem has not reproduced (it has been several hours).
Nice info, thank you!
Perhaps pstree
is counting all the times that systemd restarted the unit for us, since it is forever failing (with An error occurred in an event source, forcing termination...
error)?
Do you also see resources being used by this 9k Falco instances?
From pstree
man pages:
Child threads of a process are found under the parent process and are shown with the process name in curly braces, e.g.
icecast2---13*[{icecast2}]
So, it seems we are seeing 9462 threads under the Falco process. Perhaps systemd misses some cleanup? Which systemd version do you have?
Also there is no real difference between Falco 0.32 unit (that was kmod only though) and Falco 0.35 falco-bpf unit:
From
pstree
man pages:Child threads of a process are found under the parent process and are shown with the process name in curly braces, e.g.
icecast2---13*[{icecast2}]
So, it seems we are seeing 9462 threads under the Falco process. Perhaps systemd misses some cleanup? Which systemd version do you have?
Hi! We use systemd 245 (245.4-4ubuntu3.22)
Hey! Thank you for this very detailed issue! We will look into it asap!
Today I tried to run Falco without systemd service on Vault nodes and so far the problem has not reproduced (it has been several hours).
Nice info, thank you! Perhaps
pstree
is counting all the times that systemd restarted the unit for us, since it is forever failing (withAn error occurred in an event source, forcing termination...
error)?Do you also see resources being used by this 9k Falco instances?
I dont see real resources being used by this Falco instances
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten
/remove-lifecycle rotten
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten
/remove-lifecycle rotten