falco Falco creates a lot of processes

Describe the bug

Hi! I have about 100 nodes where Falco is installed. Some of them are starting to have problems. The problem exists only for nodes in clusters (Vault, Kafka). I am not sure if I have correctly identified the point at which the problem occurred, but it seems to be a change of leader in the cluster. When Vault cluster changes leader, Falco starts to create a lot of new processes (9000+) and my monitoring starts to send alerts about it. I don't think this is normal Faclo behavior, because on all other nodes I haven't seen that many Falco processes.

In addition, I see that Falco did not work at all since the number of processes spiked and there are no new logs in /var/log/falko.log. At the same time the systemd service looks healthy. If I restart the service, Falco starts successfully and runs until the next incident.

UPD: I starts Falco as systemd service. Only Falco v0.35.0 has this issue. Falco v0.32.0 did not have this issue

How to reproduce it

Start Falco as service on cluster node (Kafka, Vault) (?)

Screenshots

image_2023_07_05T17_45_55_979Z

Data from journalctl -u falco Selection_031

Environment

Falco version: 0.35.0
OS: Ubuntu 20
Kernel: 5.4.0-152-generic
Installation method: apt package

Additional context

Today I tried to run Falco without systemd service on Vault nodes and so far the problem has not reproduced (it has been several hours).

Jul 07 '23 16:07 konstantin-921

Just to provide some more context from my conversations with Konstantin. It looks like the falco service keeps trying to relaunch a falco instance. That instance ultimately never gets to run as it crashes right away with an error:

An error occurred in an event source, forcing termination...

Jul 07 '23 17:07 terylt

Hey! Thank you for this very detailed issue! We will look into it asap!

Today I tried to run Falco without systemd service on Vault nodes and so far the problem has not reproduced (it has been several hours).

Nice info, thank you! Perhaps pstree is counting all the times that systemd restarted the unit for us, since it is forever failing (with An error occurred in an event source, forcing termination... error)?

Do you also see resources being used by this 9k Falco instances?

Jul 10 '23 07:07 FedeDP

From pstree man pages:

Child threads of a process are found under the parent process and are shown with the process name in curly braces, e.g.
      icecast2---13*[{icecast2}]

So, it seems we are seeing 9462 threads under the Falco process. Perhaps systemd misses some cleanup? Which systemd version do you have?

Jul 10 '23 07:07 FedeDP

Also there is no real difference between Falco 0.32 unit (that was kmod only though) and Falco 0.35 falco-bpf unit:

Jul 10 '23 08:07 FedeDP

From pstree man pages:
Child threads of a process are found under the parent process and are shown with the process name in curly braces, e.g.
      icecast2---13*[{icecast2}]
So, it seems we are seeing 9462 threads under the Falco process. Perhaps systemd misses some cleanup? Which systemd version do you have?

Hi! We use systemd 245 (245.4-4ubuntu3.22)

Jul 10 '23 12:07 konstantin-921

Hey! Thank you for this very detailed issue! We will look into it asap!

Today I tried to run Falco without systemd service on Vault nodes and so far the problem has not reproduced (it has been several hours).

Nice info, thank you! Perhaps pstree is counting all the times that systemd restarted the unit for us, since it is forever failing (with An error occurred in an event source, forcing termination... error)?

Do you also see resources being used by this 9k Falco instances?

I dont see real resources being used by this Falco instances

Jul 10 '23 12:07 konstantin-921

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Nov 29 '23 21:11 poiana

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

Dec 29 '23 21:12 poiana

/remove-lifecycle rotten

Jan 03 '24 13:01 Andreagit97

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Apr 02 '24 15:04 poiana

/remove-lifecycle stale

Apr 02 '24 16:04 Andreagit97

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Jul 01 '24 21:07 poiana

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

Jul 31 '24 22:07 poiana

/remove-lifecycle rotten

Aug 01 '24 07:08 Andreagit97

falco falco copied to clipboard

Falco creates a lot of processes

falco
falco copied to clipboard