falco icon indicating copy to clipboard operation
falco copied to clipboard

Falco creates a lot of processes

Open konstantin-921 opened this issue 1 year ago • 14 comments

Describe the bug

Hi! I have about 100 nodes where Falco is installed. Some of them are starting to have problems. The problem exists only for nodes in clusters (Vault, Kafka). I am not sure if I have correctly identified the point at which the problem occurred, but it seems to be a change of leader in the cluster. When Vault cluster changes leader, Falco starts to create a lot of new processes (9000+) and my monitoring starts to send alerts about it. I don't think this is normal Faclo behavior, because on all other nodes I haven't seen that many Falco processes.

In addition, I see that Falco did not work at all since the number of processes spiked and there are no new logs in /var/log/falko.log. At the same time the systemd service looks healthy. If I restart the service, Falco starts successfully and runs until the next incident.

UPD: I starts Falco as systemd service. Only Falco v0.35.0 has this issue. Falco v0.32.0 did not have this issue

How to reproduce it

Start Falco as service on cluster node (Kafka, Vault) (?)

Screenshots

image_2023_07_05T17_45_55_979Z

Data from journalctl -u falco Selection_031

Environment

  • Falco version: 0.35.0
  • OS: Ubuntu 20
  • Kernel: 5.4.0-152-generic
  • Installation method: apt package

Additional context

Today I tried to run Falco without systemd service on Vault nodes and so far the problem has not reproduced (it has been several hours).

konstantin-921 avatar Jul 07 '23 16:07 konstantin-921

Just to provide some more context from my conversations with Konstantin. It looks like the falco service keeps trying to relaunch a falco instance. That instance ultimately never gets to run as it crashes right away with an error:

An error occurred in an event source, forcing termination...

terylt avatar Jul 07 '23 17:07 terylt

Hey! Thank you for this very detailed issue! We will look into it asap!

Today I tried to run Falco without systemd service on Vault nodes and so far the problem has not reproduced (it has been several hours).

Nice info, thank you! Perhaps pstree is counting all the times that systemd restarted the unit for us, since it is forever failing (with An error occurred in an event source, forcing termination... error)?

Do you also see resources being used by this 9k Falco instances?

FedeDP avatar Jul 10 '23 07:07 FedeDP

From pstree man pages:

Child threads of a process are found under the parent process and are shown with the process name in curly braces, e.g.

      icecast2---13*[{icecast2}]

So, it seems we are seeing 9462 threads under the Falco process. Perhaps systemd misses some cleanup? Which systemd version do you have?

FedeDP avatar Jul 10 '23 07:07 FedeDP

Also there is no real difference between Falco 0.32 unit (that was kmod only though) and Falco 0.35 falco-bpf unit:

FedeDP avatar Jul 10 '23 08:07 FedeDP

From pstree man pages:

Child threads of a process are found under the parent process and are shown with the process name in curly braces, e.g.

      icecast2---13*[{icecast2}]

So, it seems we are seeing 9462 threads under the Falco process. Perhaps systemd misses some cleanup? Which systemd version do you have?

Hi! We use systemd 245 (245.4-4ubuntu3.22)

konstantin-921 avatar Jul 10 '23 12:07 konstantin-921

Hey! Thank you for this very detailed issue! We will look into it asap!

Today I tried to run Falco without systemd service on Vault nodes and so far the problem has not reproduced (it has been several hours).

Nice info, thank you! Perhaps pstree is counting all the times that systemd restarted the unit for us, since it is forever failing (with An error occurred in an event source, forcing termination... error)?

Do you also see resources being used by this 9k Falco instances?

I dont see real resources being used by this Falco instances

image

konstantin-921 avatar Jul 10 '23 12:07 konstantin-921

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Nov 29 '23 21:11 poiana

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana avatar Dec 29 '23 21:12 poiana

/remove-lifecycle rotten

Andreagit97 avatar Jan 03 '24 13:01 Andreagit97

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Apr 02 '24 15:04 poiana

/remove-lifecycle stale

Andreagit97 avatar Apr 02 '24 16:04 Andreagit97

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Jul 01 '24 21:07 poiana

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana avatar Jul 31 '24 22:07 poiana

/remove-lifecycle rotten

Andreagit97 avatar Aug 01 '24 07:08 Andreagit97