falco icon indicating copy to clipboard operation
falco copied to clipboard

[UMBRELLA] Dropped events

Open leogr opened this issue 4 years ago • 38 comments

Why this issue?

When Falco is running, a producer (a.k.a the driver) continuously forwards events to a consumer (the Falco userspace program) with a buffer that sits in the middle. When - for any reason - the consumer is not able to consume the incoming events, then an events drop occurs.

Starting from v0.15.0, Falco introduced a mechanism to detect dropped events and take actions, as explained in the official documentation. However, events drop is still an issue as reported by many users.

Since the problem depends on many factors and can be hard to analyze and understand, this issue aims to give users an overview of that and collect the knowledge acquired until now.

Please, note that this document does not try to directly solve this problem and consider that some assumptions might be wrong (feel free to confute them!).

N.B. At the time of writing, it was not clear the outcome of this issue, so I just chose to label it as "documentation".

Event drop alerts

Currently, when dropped events are detected, Falco will print out some statistics that can give users some information about the kind of drops that happened.

An example of an alert regarding dropped events:

22:01:56.865201866: Debug Falco internal: syscall event drop. 10334471 system calls dropped in last second. (ebpf_enabled=0 n_drops=10334471 n_drops_buffer_clone_fork_enter=0 n_drops_buffer_clone_fork_exit=0 n_drops_buffer_connect_enter=0 n_drops_buffer_connect_exit=0 n_drops_buffer_dir_file_enter=0 n_drops_buffer_dir_file_exit=0 n_drops_buffer_execve_enter=0 n_drops_buffer_execve_exit=0 n_drops_buffer_open_enter=0 n_drops_buffer_open_exit=0 n_drops_buffer_other_interest_enter=0 n_drops_buffer_other_interest_exit=0 n_drops_buffer_total=10334471 n_drops_bug=0 n_drops_page_faults=0 n_drops_scratch_map=0 n_evts=21356397)

Note that the statistics reported above are relative to the timeframe specified (the last second) and not cumulative. Furthermore, note that an event represents just a single syscall received by the driver, regardless of whether a rule is triggered or not.

So, what do those values mean?

  • ebpf_enabled indicates whenever the driver is the eBPF probe (=1) or a kernel module (=0)
  • n_drops is the sum of others n_drop_* fields (see the section below) and represents the total number of dropped events
  • n_evts is the number of events that the driver should send according to its configuration. It also includes n_drops since n_drops is the number of events that the driver should send to userspace but is not able to send due to various reasons.

Also note that in extreme cases, drop alerts may be rate-limited, so consider incrementing those values in the configuration file, for example:

syscall_event_drops:
  actions:
    - alert
  rate: 100
  max_burst: 1000

Kind of drops

As you can notice, not all drops are the same. Below an explanation for each kind of them (ordered by the less frequent to the most one).

  • n_drops_bug is the number of dropped events caused by an invalid condition in the kernel instrumentation, something went wrong basically. AFAIK, only the eBPF probe can generate this kind of drop, and luckily there are no reports of this problem.
  • n_drops_pf (where pf stands for page fault) is the number of dropped events caused by invalid memory access; it happens whenever the memory page (referenced by the syscall) had disappeared before the driver was able to collect information about the event. We noticed that rarely, it sometimes happens on GKE, and it is related to some process that is continuously crashing (see #1309).
  • n_drops_buffer is the number of dropped events caused by a full buffer (the buffer sits between the producer and the consumer). It's the most frequent one, and it's related to performance. We have also different categories of buffer drops to understand which syscall triggered them (e.g. n_drops_buffer_clone_fork_exit, n_drops_buffer_connect_enter, ...)

Those fields are defined by the driver in this struct

Performance-related drops (n_drops_buffer)

We experience this kind of event dropping when the consumer is blocked for a while (note that the task that consumes events is single-threaded). That is strictly related to performance and can happen for several reasons. We also added a benchmark command in the event-generator to experiment with this problem (see https://github.com/falcosecurity/event-generator/pull/36 for more details).

Possible causes:

Limited CPU resource

The consumer hits the maximum CPU resources allocated for it and gets blocked for a while. For example, the official Helm chart comes with a 200m CPU hard limit that may cause this problem.

Large/complex ruleset (high CPU usage)

The larger and more complex the ruleset, the more CPU will be needed. At some point, either with or without resource limitation, high CPU usage can produce event dropping.

Fetching metadata from external components (I/O blocking)

In some cases, fetching metadata (e.g., container information, k8s metadata) from an external component can be a blocking operation.

For example, the --disable-cri-async flag is quite explanatory about that:

--disable-cri-async           Disable asynchronous CRI metadata fetching.
                               This is useful to let the input event wait for the container metadata fetch
                               to finish before moving forward. Async fetching, in some environments leads
                               to empty fields for container metadata when the fetch is not fast enough to be
                               completed asynchronously. This can have a performance penalty on your environment
                               depending on the number of containers and the frequency at which they are created/started/stopped

Another option that might cause problem is:

-A                            Monitor all events, including those with EF_DROP_SIMPLE_CONS flag.

Slow responses from the Kubernetes API server could cause this problem too.

Please note: k8s.ns.name and k8s.pod.* (i.e., k8s.pod.name, k8s.pod.id, k8s.pod.labels, and k8s.pod.label.*) are populated with data fetched from the container runtime so you don't need to enable the k8s enrichment you need only these fields

Blocking output (I/O blocking)

Falco outputs mechanism can also have an impact and might block the event processing for a while, producing drops.

The buffer size

If you are not able to solve your drop issues you can always increase the syscall buffer size (the shared buffer between userspace and kernel that contains all collected data). You can find more info on how to change its dimension in the Falco config file

Debugging

When debugging, the first time to consider is that multiple causes may occur simultaneously. It is worth excluding every single cause, one by one.

Once n_drops_bug and n_drops_pf cases are excluded, for the performance-related drops (ie. n_drops_buffer) a handy checklist is:

  • [ ] Make sure you're using the latest Falco version (and the last Helm chart version, if using this installation method) and the rule files are updated to match that version
  • [ ] Make sure docker (and containerd, if enabled) options are appropriately configured and the Falco can access the socket (the official documentation and the k8s deployment resource example can help with that)
  • [ ] Increase syscall_event_drops values, to avoid that drop alerts are being rate-limited
  • [ ] Remove custom rules, if any
  • [ ] Remove any CPU resource limitation (for example this)
  • [ ] Remove the -A option, if any
  • [ ] Remove the --disable-cri-async option, if any
  • [ ] Remove the -U option, if any
  • [ ] Make sure you are using -k https://$(KUBERNETES_SERVICE_HOST) (instead of -k https://kubernetes.default, see this comment)
  • [ ] Disable K8s Audit Events by setting webserver.enabled to false in the config file and removing any other related configuration
  • [ ] Completely disable K8s support (by removing -K, -k, and -pk options)
  • [ ] Disable any other integration, if any
  • [ ] Disable one-by-one all outputs, including stdout_output (event drop alerts still show up)

Finally, some useful links that could help with debugging:

  • Interesting issues about drops https://github.com/falcosecurity/falco/issues/669, https://github.com/falcosecurity/falco/issues/669#issuecomment-570824776, https://github.com/falcosecurity/falco/issues/669#issuecomment-642000182, https://github.com/falcosecurity/falco/issues/961, https://github.com/falcosecurity/falco/issues/1382, https://github.com/falcosecurity/falco/issues/615, https://github.com/falcosecurity/falco/issues/1231, https://github.com/falcosecurity/falco/issues/558

  • Issues related to n_drops_pf https://github.com/falcosecurity/falco/issues/917, https://github.com/falcosecurity/falco/issues/770, https://github.com/falcosecurity/falco/issues/669#issuecomment-641466585

  • Some threads on our Slack channel https://kubernetes.slack.com/archives/CMWH3EH32/p1592904527372800, https://kubernetes.slack.com/archives/CMWH3EH32/p1599741275086000

  • Drop related to the K8s support (-K, -k, and -pk options) https://github.com/falcosecurity/falco/issues/2129

leogr avatar Sep 16 '20 09:09 leogr

Greetings!

In our environment, Falco was reporting a lot of non-actionable weird events. For example:

"rule": "Non sudo setuid",
"output_fields": {
  "proc.pname": null,
  "proc.cmdline": "<NA>",
  "user.uid": 4294967295,
  "k8s.pod.name": null,
  "evt.time": 1600013430060232000,
  "evt.arg.uid": "root",
  "container.id": "host",
  "k8s.ns.name": null,
  "user.name": null,
  "container.image.repository": null
}

This event was not telling us anything useful.

We started an investigation with @leogr and we rapidly made the connection with dropped system calls. To really see the problem, we adjusted the rate and max_burst of syscall_event_drops events. Falco was dropping thousands of system calls:

"rule": "Falco internal: syscall event drop",
"output_fields": {
  "n_evts": "1113206",
  "n_drops_bug": "0",
  "n_drops_pf": "0",
  "ebpf_enabled": "0",
  "n_drops": "621240",
  "n_drops_buffer": "621240"
}

Our average of dropped system calls was over 1,000,000 every minute. Screen Shot 2020-09-24 at 10 22 10 AM

Our first experiment was to get rid of all Falco rules that are in the rules file, but disabled. It did not fix our problem.

Ou second experiment was to get rid of the Kubernetes integration (by removing -K, -k, and -pk options). Falco's performance improved, but it was only temporary. After a few hours, Falco was re-dropping thousands of system calls.

Then we removed a program_output and Falco almost completely stopped dropping system calls. We still see small drops, but nothing like before. Screen Shot 2020-09-24 at 10 23 47 AM

Our program output was a simple bash script calling aws sns publish to publish events to SNS. The program_output is blocking the main Falco threat that processes system calls. Since the aws sns publish call is slow, Falco falls behind, the buffer between the kernel probe and the user space gets full and Falco drops system calls. Dropping system calls generates a Falco internal: syscall event drop event, which calls aws sns publish which slows down Falco even more. We were creating a self feeding loop of Falco internal: syscall event drop events.

We went back to the drawing board and replaced our program_output by an http_output to the Falco Sidekick service. The http_output is still "blocking" on the main Falco thread, but since the Sidekick replies very quickly, the impact is not noticeable and Falco is able to process system calls without having to drop them.

A permanent fix for theses issues is to make all Falco outputs non-blocking. Falco should not block himself while trying to report us security events 😄

For the drops we still have, they might be caused by increases in the number of system calls to process due to activities in our environment. We might need to dig deeper, get statistics on the number of system calls processed by Falco and statistics on peaks, pin point what is generating those peaks.

At least, weird non-actionable events are gone! 🎉

That's all folks!

JPLachance avatar Sep 24 '20 14:09 JPLachance

Hey @JPLachance

Thank you a lot for reporting this! :partying_face:

leogr avatar Sep 24 '20 15:09 leogr

Really nice report. For people that could be interested, AWS SNS is coming in next falcosidekick release.

Issif avatar Sep 24 '20 16:09 Issif

Hi everyone I'm implementing security for GKE in my company. I had some tests with falco(only get n_drops_pf with 1 alert/1 hour/ 1 pod) Screen Shot 2020-09-28 at 10 58 18 AM After read more about n_drops_pf, I decided to deploy falco to production. The first cluster is least traffic cluster and i got many n_drops_buffer alerts. So I will give your team some debugging information. My cluster use GKE v1.16.13-gke.401, OS: Ubuntu 18.04.5 LTS and Container-Optimized OS. Resource limitation: 1cpu and 2G Ram (in fact never reached this limit) Falco v0.25.0 Parameter: Screen Shot 2020-09-28 at 9 55 40 AM I only use /etc/falco/falco_rules.local.yaml(remove all rule, do not add any new rules) webserver.enabled = false and i test one by one output type(syslog, file, stdout, http) Only syslog output: Screen Shot 2020-09-28 at 9 48 51 AM Only file output: Screen Shot 2020-09-28 at 10 22 55 AM Only stdout output: Screen Shot 2020-09-28 at 10 31 47 AM Only http output: Screen Shot 2020-09-28 at 10 44 28 AM => n_drops_buffer alerts has same characteristics: 18 syscalls drop per 1 minute(devide to 1, 2, or 3 alerts) and continuously I think it relate to GKE, traffic can't drop so steadily. I will deploy 1 more cluster to get more information and check about this regularity. I hope this report is useful to you, now i will remove alert for drop syscall event. Have a nice day! Updating: I deployed another cluster and got the same result. Additional information: number of dropped syscalls per minute depend on version of GKE

minhvo3 avatar Sep 28 '20 05:09 minhvo3

In EKS 1.18, n_drops_pf= will appeared syscall drops every 20 seconds with 3~4 count.

[pod/falco-mgbdt/falco] {"output":"Falco internal: syscall event drop. 7 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"7","n_drops_buffer":"0","n_drops_bug":"0","n_drops_pf":"7","n_evts":"5617"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-11-09T05:32:41.177260257Z"}
[pod/falco-mgbdt/falco] {"output":"Falco internal: syscall event drop. 3 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"3","n_drops_buffer":"0","n_drops_bug":"0","n_drops_pf":"3","n_evts":"4911"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-11-09T05:33:10.252872900Z"}
[pod/falco-mgbdt/falco] Mon Nov  9 05:33:10 2020: Falco internal: syscall event drop. 3 system calls dropped in last second.

I figured out from system log that kubelet-eks generates log every 5 seconds.

{"level":"info","ts":"2020-11-12T07:34:39.754Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}
{"level":"info","ts":"2020-11-12T07:34:44.765Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}
{"level":"info","ts":"2020-11-12T07:34:49.776Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}
{"level":"info","ts":"2020-11-12T07:34:54.787Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.5 ..."}

I thought to be associated with 3-4 drops every 20 seconds.

I was able to find this log generated by aws-vpc-cni-plugin: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/routed-eni-cni-plugin/cni.go#L335

I can't find the any misconfiguration about aws-vpc-cni-plugin or EKS.

Here is my envrionment:

Falco version: 0.26.1
Driver version: 2aa88dcf6243982697811df4c1b484bcbe9488a2
running on EKS 1.18
Linux ip-10-2-60-122 5.4.0-1028-aws #29-Ubuntu SMP Mon Oct 5 15:30:10 UTC 2020 x86_64 GNU/Linux

And Worker node startup with ami-03814f00478507685 https://docs.aws.amazon.com/eks/latest/userguide/eks-partner-amis.html

So I decided to turn off notifications for syscall drop.

DingGGu avatar Nov 12 '20 07:11 DingGGu

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Mar 01 '21 15:03 poiana

/remove-lifecycle stale

leogr avatar Mar 11 '21 08:03 leogr

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Jun 09 '21 13:06 poiana

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana avatar Jul 09 '21 14:07 poiana

/remove-lifecycle rotten L.

On Fri, Jul 9, 2021 at 4:01 PM poiana @.***> wrote:

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/falcosecurity/falco/issues/1403#issuecomment-877211892, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5J42JHQLJQGWYNTREEPLTW36MXANCNFSM4ROQ3HOA .

leodido avatar Jul 09 '21 15:07 leodido

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Oct 07 '21 21:10 poiana

/remove-lifecycle stale

leogr avatar Oct 11 '21 06:10 leogr

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Feb 20 '22 16:02 poiana

/remove-lifecycle stale

leogr avatar Feb 21 '22 09:02 leogr

Update: It seems that 0.31.0 mitigates syscall drops better than previous versions (see https://github.com/falcosecurity/falco/issues/1870#issuecomment-1046730599)

Has anyone experienced any improvement? If yes, please share :)

leogr avatar Feb 21 '22 13:02 leogr

Hi folks :vulcan_salute: I just wanted to bring you some interesting news about the drops here. Please note that what I am about to say concerns only the bpf probe and not the kernel module. Doing some analysis on the bpf code I found that the drops due to the full perf buffer were counted twice, this is the reason for the large number of drops recorded with bpf despite the lower speeds compared to the kernel module. Therefore after this patch, the bpf probe should always record lower throughputs than the kernel module, but being slower should also bring a lower number of drops. So today the bpf probe could become a solution in all those cases where the drops registered with the kernel module are excessive.

Anyone who wants to test the bpf probe after this patch would give us a big hand to understand where to focus to further improve our probe.

Last but not least, during this patch I also added a new drop category PPM_FAILURE_FRAME_SCRATCH_MAP_FULL to clearly separate drops due to a full buffer from those caused by events that are too large:

  • PPM_FAILURE_FRAME_SCRATCH_MAP_FULL: drops due to an event too large that cannot be saved into the frame_scratch_map before being copied in the perfbuf.
  • PPM_FAILURE_BUFFER_FULL: drops due to a full perbuf.

Prior to this patch, both types of drops fell under the same category PPM_FAILURE_BUFFER_FULL, which was slightly misleading.

Some references:

Patch in libs: https://github.com/falcosecurity/libs/pull/224 Falco adapted with the new drop category: https://github.com/falcosecurity/falco/pull/1916

Andreagit97 avatar Feb 24 '22 15:02 Andreagit97

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar May 25 '22 18:05 poiana

/remove-lifecycle stale

leogr avatar May 30 '22 08:05 leogr

@leogr, what do you think about adding https://github.com/falcosecurity/falco/issues/2129 to the list of interesting linked issues?

jasondellaluce avatar Jul 14 '22 12:07 jasondellaluce

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Oct 12 '22 15:10 poiana

/remove-lifecycle stale

jasondellaluce avatar Oct 12 '22 16:10 jasondellaluce

I too am running into the -k issue. Here's the output of me running for several minutes with the event generator running. I would expect to see a ton more events as I do when I leave out the -k/-K bits (-pk doesn't affect it). I also don't ever see any of the additional context I'm supposed to see with -k, so I am guessing that I'm only actually seeing events where container.info isn't in the output and it's not trying to get add'l context from the k8s api.

There are more events that come in after this but more of the same, no add'l k8s context other than pod and namespace, a fraction of what I'd expect to see, and lots of buffer drops.

This is on AWS EKS, brand new cluster, AL2, K8s 1.24, with the event generator running. I use the CLI to get a token into a file (aws eks get-token), but it would be better if Falco was able to do this under the hood on AWS nodes as those tokens don't live very long.

[root@ip-10-30-3-248 ~]# FALCO_BPF_PROBE="/.falco/falco-bpf.o" falco -pk -k https://xxxxxxxxxxxx.sk1.us-east-1.eks.amazonaws.com -K token Fri Dec 30 16:50:18 2022: Falco version: 0.33.1 (x86_64) Fri Dec 30 16:50:18 2022: Falco initialized with configuration file: /etc/falco/falco.yaml Fri Dec 30 16:50:18 2022: Loading rules from file /etc/falco/falco_rules.yaml Fri Dec 30 16:50:18 2022: Loading rules from file /etc/falco/falco_rules.local.yaml Fri Dec 30 16:50:18 2022: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs) Fri Dec 30 16:50:18 2022: Starting health webserver with threadiness 4, listening on port 8765 Fri Dec 30 16:50:18 2022: Enabled event sources: syscall Fri Dec 30 16:50:18 2022: Opening capture with BPF probe. BPF probe path: /.falco/falco-bpf.o {"hostname":"ip-10-30-3-248.ec2.internal","output":"16:50:22.707321322: Warning Sensitive file opened for reading by non-trusted program (user=<NA> user_loginuid=-1 program=httpd command=httpd --loglevel info run ^syscall.ReadSensitiveFileUntrusted$ --sleep 6s pid=14891 file=/etc/shadow parent=event-generator gparent=containerd-shim ggparent=systemd gggparent=<NA> container_id=df43a6a0a650 image=docker.io/falcosecurity/event-generator) k8s.ns=event-generator k8s.pod=event-generator-6c7cbf956f-4z8z5 container=df43a6a0a650","priority":"Warning","rule":"Read sensitive file untrusted","source":"syscall","tags":["filesystem","mitre_credential_access","mitre_discovery"],"time":"2022-12-30T16:50:22.707321322Z", "output_fields": {"container.id":"df43a6a0a650","container.image.repository":"docker.io/falcosecurity/event-generator","evt.time":1672419022707321322,"fd.name":"/etc/shadow","k8s.ns.name":"event-generator","k8s.pod.name":"event-generator-6c7cbf956f-4z8z5","proc.aname[2]":"containerd-shim","proc.aname[3]":"systemd","proc.aname[4]":null,"proc.cmdline":"httpd --loglevel info run ^syscall.ReadSensitiveFileUntrusted$ --sleep 6s","proc.name":"httpd","proc.pid":14891,"proc.pname":"event-generator","user.loginuid":-1,"user.name":"<NA>"}} {"hostname":"ip-10-30-3-248.ec2.internal","output":"16:50:22.810666221: Warning Sensitive file opened for reading by non-trusted program (user=<NA> user_loginuid=-1 program=event-generator command=event-generator run --loop pid=1267 file=/etc/shadow parent=containerd-shim gparent=systemd ggparent=<NA> gggparent=<NA> container_id=df43a6a0a650 image=docker.io/falcosecurity/event-generator) k8s.ns=event-generator k8s.pod=event-generator-6c7cbf956f-4z8z5 container=df43a6a0a650","priority":"Warning","rule":"Read sensitive file untrusted","source":"syscall","tags":["filesystem","mitre_credential_access","mitre_discovery"],"time":"2022-12-30T16:50:22.810666221Z", "output_fields": {"container.id":"df43a6a0a650","container.image.repository":"docker.io/falcosecurity/event-generator","evt.time":1672419022810666221,"fd.name":"/etc/shadow","k8s.ns.name":"event-generator","k8s.pod.name":"event-generator-6c7cbf956f-4z8z5","proc.aname[2]":"systemd","proc.aname[3]":null,"proc.aname[4]":null,"proc.cmdline":"event-generator run --loop","proc.name":"event-generator","proc.pid":1267,"proc.pname":"containerd-shim","user.loginuid":-1,"user.name":"<NA>"}} {"hostname":"ip-10-30-3-248.ec2.internal","output":"Falco internal: syscall event drop. 105359 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"105359","n_drops_buffer_clone_fork_enter":"1116","n_drops_buffer_clone_fork_exit":"2208","n_drops_buffer_connect_enter":"559","n_drops_buffer_connect_exit":"558","n_drops_buffer_dir_file_enter":"2045","n_drops_buffer_dir_file_exit":"2045","n_drops_buffer_execve_enter":"398","n_drops_buffer_execve_exit":"408","n_drops_buffer_open_enter":"17359","n_drops_buffer_open_exit":"17363","n_drops_buffer_other_interest_enter":"168","n_drops_buffer_other_interest_exit":"167","n_drops_buffer_total":"105359","n_drops_bug":"0","n_drops_page_faults":"0","n_drops_scratch_map":"0","n_evts":"266927"},"priority":"Debug","rule":"Falco internal: syscall event drop","time":"2022-12-30T16:50:22.911085073Z"}`

maynardflies avatar Dec 30 '22 17:12 maynardflies

@maynardflies thanks for jumping in. To your points:

  • The -k flag and the k8s cluster's metadata collection adds plenty of workload to Falco instances, which is known to increase the number of dropped events in large clusters. This is due to Falco using a non-blocking event consuming model for performance reasons. This is probably the reason why you see less total events and more drop-related alerts when using the -k flag
  • By default, -pk appends the following string to each rule's output: k8s.ns=%k8s.ns.name k8s.pod=%k8s.pod.name container=%container.id. You're not getting additional k8s context info fields because they are probably not part of the output. You can append custom strings by using the -p flag with your desired appended output such as -p "k8s.ns=%k8s.ns.name, k8s.deployment=%k8s.deployment.name, ..."
  • Basic k8s context info, such as the k8s.ns.name and k8s.pod.name included by default with -pk, don't even strictly require the k8s api server communication from my knowledge. That info is optimized to be collected from the local container runtime running the pod, and so should be available even when -k is not enabled. This allows many users to have "some" k8s context info while also avoiding the extra performance stress of the api server communication
  • Good point about the automatic token retrieval from Falco, I think this was not proposed yet

jasondellaluce avatar Jan 04 '23 11:01 jasondellaluce

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Apr 11 '23 01:04 poiana

/remove-lifecycle stale

Andreagit97 avatar Apr 11 '23 12:04 Andreagit97

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Jul 10 '23 13:07 poiana

/remove-lifecycle stale

jasondellaluce avatar Jul 10 '23 13:07 jasondellaluce

ebpf_enabled indicates whenever the driver is the eBPF probe (=1) or a kernel module (=0)

This does no longer seem to hold true when running falco with the --modern-bpf flag, as it returns ebpf_enabled:0.

mimor avatar Aug 31 '23 13:08 mimor

that's a good point, thanks!

Andreagit97 avatar Aug 31 '23 14:08 Andreagit97

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Jan 12 '24 09:01 poiana