falco
falco copied to clipboard
Falco 0.37.1 modern_ebpf crashes server
Describe the bug
After upgrading Falco from 0.36.2 to 0.37.1 and switching driver from ebpf to modern_ebpf, it causes physical server with higher load to crash.
How to reproduce it
Random behaviour over time on more loaded physical servers.
Environment
- Falco version: 0.37.1
- System info:
{
"machine": "x86_64",
"nodename": "falcosecurity-falco-<...>",
"release": "6.1.42-1.el8.x86_64",
"sysname": "Linux",
"version": "#1 SMP PREEMPT_DYNAMIC Tue Aug 1 07:24:16 UTC 2023"
}
- Cloud provider or hardware configuration:
- OS: Rocky Linux 8.8
- CPU: AMD EPYC 7742 64-Core Processor 128 cores
- Kernel: Linux 6.1.42-1.el8.x86_64 SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
- Installation method: plain OSS Kubernetes
Additional context
Crashdump:
17284898.905756] IPv6: ADDRCONF(NETDEV_CHANGE): cali841dc279d4d: link becomes ready
[17285388.370981] IPv6: ADDRCONF(NETDEV_CHANGE): cali6a7f0dad2a8: link becomes ready
[17285491.259227] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[17285491.259283] IPv6: ADDRCONF(NETDEV_CHANGE): cali5d758ecb513: link becomes ready
[17285552.983963] BUG: unable to handle page fault for address: ffffffffff6000c7
[17285552.987818] #PF: supervisor read access in kernel mode
[17285552.991552] #PF: error_code(0x0000) - not-present page
[17285552.995304] PGD 6a0e067 P4D 6a0e067 PUD 6a10067 PMD 6a12067 PTE 0
[17285552.999051] Oops: 0000 [#1] PREEMPT SMP NOPTI
[17285553.002776] CPU: 31 PID: 95831 Comm: kube-proxy Kdump: loaded Not tainted 6.1.42-1.el8.x86_64 #1
[17285553.006737] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.11.4 03/22/2023
[17285553.010774] RIP: 0010:copy_from_kernel_nofault+0x6d/0x120
[17285553.014852] Code: f8 4c 89 e7 4b 8d 14 2c 31 f6 48 c1 e8 03 4d 8d 44 c4 08 eb 13 48 83 c7 08 48 89 d1 48 83 c3 08 48 29 f9 4c 39 c7 74 34 89 f1 <48> 8b 03 48 89 07 85 c9 74 e1 65 48 8b 04 25 c0 bb 01 00 83 a8 18
[17285553.023657] RSP: 0018:ffffc90003be7d80 EFLAGS: 00010256
[17285553.028208] RAX: 0000000000000000 RBX: ffffffffff6000c7 RCX: 0000000000000000
[17285553.033957] RDX: ffffc90003be7e18 RSI: 0000000000000000 RDI: ffffc90003be7e10
[17285553.038745] RBP: ffffc90003be7d98 R08: ffffc90003be7e18 R09: 0000000000000000
[17285553.043381] R10: 0000000000000001 R11: ffff88826a519990 R12: ffffc90003be7e10
[17285553.048067] R13: 0000000000000008 R14: 0000000000000000 R15: ffffc90003be7e98
[17285553.052769] FS: 000000c000d90890(0000) GS:ffff88fe7d9c0000(0000) knlGS:0000000000000000
[17285553.057962] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17285553.062947] CR2: ffffffffff6000c7 CR3: 000000153ab3c000 CR4: 0000000000350ee0
[17285553.068504] Call Trace:
[17285553.074274] <TASK>
[17285553.079218] ? show_regs.cold.14+0x1a/0x1f
[17285553.084320] ? __die_body+0x1f/0x70
[17285553.089309] ? __die+0x2a/0x35
[17285553.094284] ? _end+0x7b5da0c7/0x0
[17285553.099340] ? page_fault_oops+0xaf/0x270
[17285553.104379] ? bpf_probe_read_kernel+0x1d/0x50
[17285553.109575] ? bpf_ringbuf_submit+0x10/0x20
[17285553.115044] ? bpf_prog_182d4293644cc965_pf_kernel+0x549/0x558
[17285553.121418] ? _end+0x7b5da0c7/0x0
[17285553.127468] ? do_user_addr_fault+0x30b/0x590
[17285553.132943] ? _end+0x7b5da0c7/0x0
[17285553.138381] ? exc_page_fault+0x6f/0x160
[17285553.143782] ? asm_exc_page_fault+0x27/0x30
[17285553.149265] ? _end+0x7b5da0c7/0x0
[17285553.154742] ? copy_from_kernel_nofault+0x6d/0x120
[17285553.160220] bpf_probe_read_kernel+0x1d/0x50
[17285553.166254] bpf_prog_3a9838b3cf5001f5_accept4_x+0x2e6/0x1589
[17285553.172566] ? bpf_probe_read_kernel+0x1d/0x50
[17285553.178263] ? bpf_prog_c5b1b737d5cb01c5_sys_exit+0x28f/0x50c
[17285553.184115] bpf_trace_run2+0x54/0xd0
[17285553.189977] __bpf_trace_sys_exit+0x9/0x10
[17285553.195917] syscall_exit_to_user_mode_prepare+0x171/0x1d0
[17285553.202015] syscall_exit_to_user_mode+0xd/0x40
[17285553.207926] do_syscall_64+0x46/0x90
[17285553.214281] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[17285553.221453] RIP: 0033:0x42130e
[17285553.228105] Code: 20 4c 89 44 24 38 e8 31 3d ff ff 48 85 f6 0f 84 97 00 00 00 48 8b 54 24 78 49 89 f1 48 8b 74 24 48 4d 89 c8 49 29 d0 4d 8b 09 <4d> 85 c9 74 b1 4d 89 ca 49 29 d1 4c 39 ce 77 a6 4c 89 44 24 70 48
[17285553.240964] RSP: 002b:000000c000e51e88 EFLAGS: 00000206
[17285553.247460] RAX: 000000c003f36f70 RBX: 00000000000000d0 RCX: 000000000002aaa0
[17285553.254109] RDX: 000000c003f36f70 RSI: 00000000000000d0 RDI: 0000000000000012
[17285553.260788] RBP: 000000c000e51f08 R08: 0000000000000018 R09: 0000000000000000
[17285553.267735] R10: 000000000002aaaa R11: 0000000000000002 R12: 000000c000e51f08
[17285553.274514] R13: 000000000000000e R14: 000000c0005c6ea0 R15: 0000000002f14f80
[17285553.280551] </TASK>
[17285553.286380] Modules linked in: xt_CT xt_multiport ipt_rpfilter ip_set_hash_net veth ip6t_REJECT nf_reject_ipv6 nf_conntrack_netlink ipt_REJECT nf_reject_ipv4 xt_addrtype xt_set ip_set_hash_ipportnet ip_set_hash_ipport ip_set_hash_ipportip ip_set_hash_ip ip_set_bitmap_port dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr xt_MASQUERADE xt_mark nft_chain_nat nf_nat xt_conntrack xt_comment nft_compat overlay ip_vs_sed ip_vs_lc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tcp_diag inet_diag amd64_edac edac_mce_amd kvm_amd kvm irqbypass wmi_bmof pcspkr rapl nf_tables sp5100_tco acpi_ipmi i2c_piix4 k10temp nfnetlink ipmi_si acpi_power_meter vfat fat sch_fq_codel ipmi_devintf ipmi_msghandler xfs libcrc32c dm_crypt sd_mod t10_pi crc64_rocksoft crc64 crct10dif_pclmul crc32_pclmul crc32c_intel sg ghash_clmulni_intel sha512_ssse3 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci i2c_algo_bit aesni_intel drm_shmem_helper crypto_simd libahci cryptd tg3 i40e drm ptp libata ccp pps_core
[17285553.286445] megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
[17285553.350752] CR2: ffffffffff6000c7
Installation using official helm chart version 0.4.2 with the following values:
services:
- name: k8saudit-webhook
type: ClusterIP
ports:
- port: 9765
protocol: TCP
# -- Tolerations to allow Falco to run on Kubernetes masters.
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
driver:
kind: modern_ebpf
modernEbpf:
bufSizePreset: 8
loader:
initContainer:
resources:
requests:
cpu: 10m
memory: 1Gi
limits:
cpu: 1000m
memory: 1Gi
falcoctl:
config:
indexes:
- name: falcosecurity
url: https://falcosecurity.github.io/falcoctl/index.yaml
artifact:
allowedTypes:
- rulesfile
- plugin
install:
refs: [k8saudit-rules:0.7]
follow:
# -- List of artifacts to be followed by the falcoctl sidecar container.
refs: [k8saudit-rules:0.7]
# -- How often the tool checks for new versions of the followed artifacts.
every: 1h
falco:
rules_file:
- /etc/falco/falco_rules.local.yaml
- /etc/falco/rules.d
json_output: true
json_include_output_property: true
json_include_tags_property: true
http_output:
enabled: true
url: "http://falcosecurity-falcosidekick:80/"
grpc:
enabled: true
bind_address: "unix:///run/falco/falco.sock"
threadiness: 0 # 0 means "auto"
grpc_output:
enabled: true
plugins:
- name: k8saudit
library_path: libk8saudit.so
init_config:
maxEventSize: "125829120"
webhookMaxBatchSize: "125829120"
open_params: "http://:9765/k8s-audit"
- name: json
library_path: libjson.so
init_config: ""
buffered_outputs: true
load_plugins: [k8saudit, json]
syscall_event_drops:
actions:
- ignore
rate: "0.03333"
max_burst: 10
log_level: notice
resources:
requests:
cpu: 1
memory: 12Gi
limits:
cpu: 2
memory: 16Gi
# Collectors for data enrichment (scenario requirement)
collectors:
docker:
enabled: false
crio:
enabled: false
kubernetes:
enabled: false
ei @apsega thank you for reporting! we will take a look ASAP!
This https://github.com/falcosecurity/libs/pull/1858 should be the cause of the failure! we will probably release it with Falco 0.38.0 by the end of the month!
Just a question, do you see this page fault sporadically or is this something that always happens?
@Andreagit97 occasionally, probably depends on that server load.
ok got it thank you!
I've seen that you have page faults ebpf programs enabled bpf_prog_182d4293644cc965_pf_kernel+. Do you use page_fault events in your rules? So something like evt.type= page_fault
I asked this because it is unusual to see page faults programs enabled, and probably these programs are generating a lot of events...
Sorry for the delay. Apparently we don't have any rules containing page_fault. Thinking if it's not misconfiguration issue from my end.
This should be solved in Falco 0.38.0!