falco icon indicating copy to clipboard operation
falco copied to clipboard

watchdog: BUG: soft lockup - CPU#4 stuck for 27s! (in Azure)

Open jcchavezs opened this issue 9 months ago • 30 comments

Describe the bug

Incident causing kernel bug:

[REDACTED DATE] [REDACTED_MACHINE] kernel: [54707.204594] watchdog: BUG: soft lockup - CPU#4 stuck for 27s! [falco:20765]
[REDACTED DATE] [REDACTED_MACHINE] kernel: [54707.208746] Modules linked in: xt_set ip_set_hash_ipport xt_recent xt_statistic xt_nfacct ipt_REJECT nf_reject_ipv4 xt_multiport xt_nat
nfnetlink_acct ipt_rpfilter iptable_raw ip_set_hash_net ip_set_hash_ip ip_set wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 libchacha poly1305_x86_64 ip6_udp_tunnel udp_tunne
l xt_MASQUERADE xt_addrtype nf_tables ip6table_filter ip6table_nat ip6table_mangle ip6_tables xt_comment iptable_nat nf_nat iptable_mangle ebtable_nat ebtables br_netfilter xt_owner iptable_security xt_conntrack
iptable_filter ip_tables overlay cfg80211 xsk_diag nf_conntrack_netlink nfnetlink mlx5_ib ib_uverbs ib_core mlx5_core mlxfw intel_rapl_msr intel_rapl_common skx_edac_common nfit kvm_intel kvm irqbypass hyperv_fb
evdev mousedev sch_fq_codel fuse configfs hid_generic hv_storvsc crc32c_intel scsi_transport_fc hid_hyperv hid hv_netvsc hv_utils hyperv_keyboard hv_balloon nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 autofs4
[REDACTED DATE] [REDACTED_MACHINE] kernel: [54707.208797] CPU: 4 PID: 20765 Comm: falco Kdump: loaded Not tainted 5.15.182.1-1.cm2 #1

How to reproduce it

I did not manage to reproduce the error.

Expected behaviour

No Kernel bug

Screenshots

Environment

  • Falco version: 0.39.2
  • System info:
  • Cloud provider or hardware configuration: Azure
  • OS:
  • Kernel: 5.15.182.1-1.cm2
  • Installation method: kubernetes

Additional context

Falco config:

append_output: []
base_syscalls:
  custom_set: []
  repair: false
buffered_outputs: false
config_files:
- /etc/falco/config.d
engine:
  kind: modern_ebpf
  modern_ebpf:
    buf_size_preset: 7
    cpus_for_each_buffer: 4
    drop_failed_exit: false
falco_libs:
  thread_table_size: 262144
file_output:
  enabled: false
  filename: ./events.txt
  keep_alive: false
grpc:
  bind_address: unix:///run/falco/falco.sock
  enabled: false
  threadiness: 0
grpc_output:
  enabled: false
http_output:
  ca_bundle: ""
  ca_cert: ""
  ca_path: /etc/falco/certs/
  client_cert: /etc/falco/certs/client/client.crt
  client_key: /etc/falco/certs/client/client.key
  compress_uploads: false
  echo: false
  enabled: false
  insecure: false
  keep_alive: false
  mtls: false
  url: ""
  user_agent: falcosecurity/falco
json_include_message_property: false
json_include_output_property: true
json_include_tags_property: true
json_output: true
libs_logger:
  enabled: false
  severity: error
load_plugins: []
log_level: error
log_stderr: true
log_syslog: false
metrics:
  convert_memory_to_mb: true
  enabled: false
  include_empty_values: false
  interval: 1h
  kernel_event_counters_enabled: true
  kernel_event_counters_per_cpu_enabled: false
  libbpf_stats_enabled: true
  output_rule: true
  resource_utilization_enabled: true
  rules_counters_enabled: true
  state_counters_enabled: true
output_timeout: 2000
outputs_queue:
  capacity: 0
plugins:
- init_config: null
  library_path: libk8saudit.so
  name: k8saudit
  open_params: http://:9765/k8s-audit
- library_path: libcloudtrail.so
  name: cloudtrail
- init_config: ""
  library_path: libjson.so
  name: json
priority: debug
program_output:
  enabled: false
  keep_alive: false
  program: 'jq ''{text: .output}'' | curl -d @- -X POST https://hooks.slack.com/services/XXX'
rule_matching: first
rules_files:
- /etc/falco/falco_rules.yaml
- /etc/falco/falco-incubating_rules.yaml
- /etc/falco/falco-sandbox_rules.yaml
stdout_output:
  enabled: true
syscall_event_drops:
  actions:
  - log
  - alert
  max_burst: 1
  rate: 0.03333
  simulate_drops: false
  threshold: 0.1
syscall_event_timeouts:
  max_consecutives: 1000
syslog_output:
  enabled: false
time_format_iso_8601: true
watch_config_files: true
webserver:
  enabled: true
  k8s_healthz_endpoint: /healthz
  listen_port: 8765
  prometheus_metrics_enabled: false
  ssl_certificate: /etc/falco/falco.pem
  ssl_enabled: false
  threadiness: 0

jcchavezs avatar Jul 09 '25 10:07 jcchavezs

FYI: we upgraded to 0.40 and we have the same issue.

jcchavezs avatar Jul 10 '25 10:07 jcchavezs

We ran into the same problem in our AKS cluster.

Cluster details:

Kubernetes: 1.30.12 Node OS: AKSAzureLinux Node OS version: CBL-Mariner 2.0.20250602 Kernel: 5.15.182.1-1.cm2

Falco details:

Falco chart: falco-4.21.3 Falco version: 0.40.0

Error:

Jul 17 04:39:35 redacted-vm-name kernel: watchdog: BUG: soft lockup - CPU#16 stuck for 26s! [falco:3774060]
Jul 17 04:39:35 redacted-vm-name kernel: Modules linked in: nft_chain_nat xfrm_user xfrm_algo cmac cifs cifs_arc4 cifs_md4 dns_resolver xt_REDIRE
CT xt_multiport ip_set_hash_ipport xt_set ip_set_hash_net ip_set_list_set xt_recent xt_statistic xt_MASQUERADE ipt_REJECT nf_reject_ipv4 ip_set xt_nat xt_add
rtype nft_counter nft_compat nf_tables ip6table_filter ip6table_nat ip6table_mangle ip6_tables xt_comment iptable_nat nf_nat ebtable_nat ebtables iptable_man
gle br_netfilter xt_owner iptable_security xt_conntrack mlx5_ib ib_uverbs ib_core mlx5_core mlxfw xsk_diag nf_conntrack_netlink nfnetlink iptable_filter ip_t
ables overlay cfg80211 intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass hyperv_fb evdev mousedev sch_fq_codel fuse configfs hid_generic hid_hyperv hv
_storvsc hid crc32c_intel hv_netvsc scsi_transport_fc hv_utils hv_balloon hyperv_keyboard nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 autofs4
Jul 17 04:39:35 <redacted_vm_name>: CPU: 16 PID: 3774060 Comm: falco Kdump: loaded Not tainted 5.15.182.1-1.cm2 #1
Jul 17 04:39:35 redacted-vm-name kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/
23/2024
Jul 17 04:39:35 redacted-vm-name kernel: RIP: 0010:do_task_stat+0xc27/0xe20
Jul 17 04:39:35 redacted-vm-name kernel: Code: 10 02 00 00 4c 8b ab c0 01 00 00 48 89 d0 4c 03 a0 c0 0a 00 00 4c 03 b8 c8 0a 00 00 4c 03 a8 80 0a
 00 00 48 8b 80 30 0a 00 00 <48> 2d 30 0a 00 00 48 39 c2 75 d9 89 8d f0 fe ff ff 4c 89 85 68 ff
Jul 17 04:39:35 redacted-vm-name kernel: RSP: 0018:ffff9ddc8619b8c8 EFLAGS: 00000246
Jul 17 04:39:35 redacted-vm-name kernel: RAX: ffff8b88fe268a30 RBX: ffff8b88dae26a40 RCX: 0000000000000000
Jul 17 04:39:35 redacted-vm-name kernel: RDX: ffff8b8a52584000 RSI: 0000000000000001 RDI: ffff8b8901535280
Jul 17 04:39:35 redacted-vm-name kernel: RBP: ffff9ddc8619b9d8 R08: ffff8b88c1e31c00 R09: ffff8b88c14ed258
Jul 17 04:39:35 redacted-vm-name kernel: R10: 0000000000000096 R11: 2000000181005ccf R12: 0000a1d55601f4e3
Jul 17 04:39:35 redacted-vm-name kernel: R13: 0000000000000000 R14: 0000000000002e78 R15: 00000021f5b5befd
Jul 17 04:39:35 redacted-vm-name kernel: FS:  0000788795e452c0(0000) GS:ffff8ba83fa00000(0000) knlGS:0000000000000000
Jul 17 04:39:35 redacted-vm-name kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 17 04:39:35 redacted-vm-name kernel: CR2: 00007142d84c3f40 CR3: 0000000de0812005 CR4: 0000000000370ee0
Jul 17 04:39:35 redacted-vm-name kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 17 04:39:35 redacted-vm-name kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 17 04:39:35 redacted-vm-name kernel: Call Trace:
Jul 17 04:39:35 redacted-vm-name kernel:  <TASK>
Jul 17 04:39:35 redacted-vm-name kernel:  proc_tgid_stat+0x14/0x20
Jul 17 04:39:35 redacted-vm-name kernel:  proc_single_show+0x52/0xc0
Jul 17 04:39:35 redacted-vm-name kernel:  seq_read_iter+0x124/0x4b0
Jul 17 04:39:35 redacted-vm-name kernel:  seq_read+0xf2/0x140
Jul 17 04:39:35 redacted-vm-name kernel:  ? 0xffffffffb2000000
Jul 17 04:39:35 redacted-vm-name kernel:  vfs_read+0x9f/0x1a0
Jul 17 04:39:35 redacted-vm-name kernel:  ksys_read+0x67/0xf0
Jul 17 04:39:35 redacted-vm-name kernel:  __x64_sys_read+0x19/0x20
Jul 17 04:39:35 redacted-vm-name kernel:  x64_sys_call+0x1dba/0x1fa0
Jul 17 04:39:35 redacted-vm-name kernel:  do_syscall_64+0x56/0x80
Jul 17 04:39:35 redacted-vm-name kernel:  ? putname+0x59/0x70
Jul 17 04:39:35 redacted-vm-name kernel:  ? kmem_cache_free+0x243/0x290
Jul 17 04:39:35 redacted-vm-name kernel:  ? _copy_to_user+0x20/0x30
Jul 17 04:39:35 redacted-vm-name kernel:  ? cp_new_stat+0x152/0x180
Jul 17 04:39:35 redacted-vm-name kernel:  ? __do_sys_newfstatat+0x49/0x70
Jul 17 04:39:35 redacted-vm-name kernel:  ? bpf_prog_6ad5a3e64cf82799_sys_exit+0x65/0x8ec
Jul 17 04:39:35 redacted-vm-name kernel:  ? bpf_trace_run2+0x3d/0xa0
Jul 17 04:39:35 redacted-vm-name kernel:  ? __bpf_trace_sys_exit+0x9/0x10
Jul 17 04:39:35 redacted-vm-name kernel:  ? syscall_exit_to_user_mode+0x2c/0x50
Jul 17 04:39:35 redacted-vm-name kernel:  ? do_syscall_64+0x63/0x80
Jul 17 04:39:35 redacted-vm-name kernel:  ? bpf_ringbuf_output+0x4a/0xf0
Jul 17 04:39:35 redacted-vm-name kernel:  ? bpf_prog_9cee710a6b25781f_openat_x+0xc47/0x1f0c
Jul 17 04:39:35 redacted-vm-name kernel:  ? bpf_trace_run2+0x3d/0xa0
Jul 17 04:39:35 redacted-vm-name kernel:  ? __bpf_trace_sys_exit+0x9/0x10
Jul 17 04:39:35 redacted-vm-name kernel:  ? syscall_exit_to_user_mode+0x2c/0x50
Jul 17 04:39:35 redacted-vm-name kernel:  ? do_syscall_64+0x63/0x80
Jul 17 04:39:35 redacted-vm-name kernel:  ? irqentry_exit_to_user_mode+0xe/0x20
Jul 17 04:39:35 redacted-vm-name kernel:  ? irqentry_exit+0x1d/0x30
Jul 17 04:39:35 redacted-vm-name kernel:  ? sysvec_hyperv_stimer0+0x4e/0x90
Jul 17 04:39:35 redacted-vm-name kernel:  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
Jul 17 04:39:35 redacted-vm-name kernel: RIP: 0033:0x788795f4b25c
Jul 17 04:39:35 redacted-vm-name kernel: Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 d9 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89
 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 2f d6 f8 ff 48
Jul 17 04:39:35 redacted-vm-name kernel: RSP: 002b:00007ffc419e29a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Jul 17 04:39:35 redacted-vm-name kernel: RAX: ffffffffffffffda RBX: 00007887945a5000 RCX: 0000788795f4b25c
Jul 17 04:39:35 redacted-vm-name kernel: RDX: 0000000000000400 RSI: 000078876b877800 RDI: 0000000000000178
Jul 17 04:39:35 redacted-vm-name kernel: RBP: 00007887960225e0 R08: 0000000000000000 R09: 0000788795e40740
Jul 17 04:39:35 redacted-vm-name kernel: R10: 0000000000001000 R11: 0000000000000246 R12: 00007ffc419e2b60
Jul 17 04:39:35 redacted-vm-name kernel: R13: 0000000000000d68 R14: 00007887960219e0 R15: 0000000000000d68
Jul 17 04:39:35 redacted-vm-name kernel:  </TASK>

This leads to node restarts.

peterD4N avatar Jul 16 '25 14:07 peterD4N

Hey, thanks for letting us know about the issue! Is everyone using modern ebpf driver here?

FedeDP avatar Jul 25 '25 09:07 FedeDP

If that's the case, latest Falco (0.41.3) ships with https://github.com/falcosecurity/libs/pull/2397 that might fix the issue.

FedeDP avatar Jul 25 '25 09:07 FedeDP

From the shared dmesg, actually it sems the issue lies in openat bpf program:

Jul 17 04:39:35 redacted-vm-name kernel:  ? bpf_prog_9cee710a6b25781f_openat_x+0xc47/0x1f0c

That is really weird. I'll give it a look! Thanks again for all the info!

FedeDP avatar Jul 25 '25 09:07 FedeDP

@FedeDP yeah we are using modern_ebpf

jcchavezs avatar Jul 25 '25 09:07 jcchavezs

Is latest Falco giving the same issue?

FedeDP avatar Jul 25 '25 12:07 FedeDP

Anyway, since both of the reports come from the exact same kernel, it might be a kernel bug too. Did you ever experienced the same issue on older kernels?

FedeDP avatar Jul 25 '25 13:07 FedeDP

It all started in 0.39

jcchavezs avatar Jul 25 '25 13:07 jcchavezs

Our problem started 8th of July 2025 looks like after kernel patch. So problem might be related to kernel. According the logs kernel versions changed.

Before: "containerd://1.6.26","kernelVersion":"5.15.180.1-1.cm2","kubeProxyVersion":"v1.30.12" After: "containerd://1.6.26","kernelVersion":"5.15.182.1-1.cm2","kubeProxyVersion":"v1.30.12"

@FedeDP we run the latest version (0.41.3) of Falco which includes https://github.com/falcosecurity/libs/pull/2397.

peterD4N avatar Jul 28 '25 12:07 peterD4N

Thanks for the info! Are you able to rollback the kernel version by chance?

FedeDP avatar Jul 28 '25 12:07 FedeDP

In AKS rollback is not supported. It is possible to spin a new node pool with older node kernel version and then migrate workloads but that will introduce other internal issues (IP, HW capacity...). We have these problems only in prod cluster where testing is not easy.

peterD4N avatar Jul 28 '25 14:07 peterD4N

I see; it was worth a try :) I am really not sure, if @jcchavezs confirms the timing of the issue with the kernel upgrade, i think kernel bug stocks rise :)

FedeDP avatar Jul 28 '25 14:07 FedeDP

kernel crashes happened on nodes with kernel version 5.15.182.1-1.cm2, for the previous version (namely 5.15.180.1-1.cm2) we don't have any report of crashes. The issue for us started around July 2th.

jcchavezs avatar Jul 28 '25 14:07 jcchavezs

for the previous version (namely 5.15.180.1-1.cm2) we don't have any report of crashes.

Thanks! Checking https://packages.microsoft.com/yumrepos/cbl-mariner-2.0-prod-base-srpms/Packages/k/ (is it the correct one?) i see

kernel-5.15.184.1-1.cm2.src.rpm 09-Jul-2025 00:59 196.1 MB

I see that there is already a new version out there that contains the .184 kernel :https://github.com/microsoft/azurelinux/releases/tag/2.0.20250701-2.0, hopefully it will be released soon and fixes the issue. 🙏

FedeDP avatar Jul 28 '25 14:07 FedeDP

Today we got a kernel update. We're now on version 5.15.184.1-1.cm2. Let's see if that helps 🤞 ,

peterD4N avatar Jul 29 '25 08:07 peterD4N

Unfortunately even after applying the kernel patch the issue persists. We experienced node restarts both yesterday and today.

peterD4N avatar Jul 30 '25 11:07 peterD4N

Do I get this correct, that this is more of a kernel issue, than it is a issue of falco?

michaelSchmidMaloon avatar Aug 07 '25 07:08 michaelSchmidMaloon

It should be; i mean considering only people on azure linux are experiencing the issue, starting from 5.15.182.1-1.cm2 kernel.

FedeDP avatar Aug 12 '25 09:08 FedeDP

We had a case open with Microsoft initially on this issue (crashing linux which takes nodes down because of watchdog: BUG: soft lockup - CPU#), but after MS found out that everything works fine with disabled Falco they asked us to talk with Falco directly to resolve it and closed the case (I work together with peterD4N in the same company).

We also used to communicate with MS through https://github.com/Azure/AKS/issues on previous issues related to azure k8s images. It should be possible to open an issue there, but not sure that we have good knowledge on this topic to open and describe the issue properly.

akorp avatar Aug 12 '25 18:08 akorp

@akorp question: are you able to try a different driver, eg: kmod or legacy ebpf probe?

FedeDP avatar Aug 13 '25 07:08 FedeDP

i mean considering only people on azure linux are experiencing the issue, starting from 5.15.182.1-1.cm2 kernel.

We're also experiencing this issue on rke2 clusters running on bare metal with kernel 5.15.0-144-generic.

Details: Kubernetes Version: v1.31.7+rke2r1 Falco Version: docker.io/falcosecurity/falco:0.41.3 Kernel Version: 5.15.0-144-generic
Node OS / Version: Ubuntu 22.04.5

I tested older falco versions which caused the same issue.

However Ubuntu 24.04.3 with kernel 6.8.0-64-generic does not show this issue.

michaelSchmidMaloon avatar Aug 13 '25 08:08 michaelSchmidMaloon

Thanks for the hint @michaelSchmidMaloon; i think we can then rule out the "kernel issue" option. Can you share the dmesg error?

FedeDP avatar Aug 13 '25 14:08 FedeDP

I think this is a kernel issue , i saw this bug on the ubuntu bug forums as well https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2118407.

Seems like the issue was caused by this kernel change https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?h=Ubuntu-5.15.0-144.154&id=8274eb7192d7bf36c18a026f89164fa99a5dcb99 and fixed by this commit in the kernel https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.15.y&id=32caa50275cc52a382ca77b53ce3d6204d509e14 .

IMO we can see if mariner also did similar change on their side and didnt added the backport fix. If this is the case we can ask them to add the backport fix.
@FedeDP @michaelSchmidMaloon

gargayush18 avatar Aug 14 '25 09:08 gargayush18

Great to hear @gargayush18 ! Thank you very much for the info!

FedeDP avatar Aug 14 '25 09:08 FedeDP

yeah @FedeDP we still need to confirm if this is the case with the current mariner version. I just took as an assumption.

gargayush18 avatar Aug 14 '25 09:08 gargayush18

@akorp question: are you able to try a different driver, eg: kmod or legacy ebpf probe?

Sorry for the delayed reply. Independently of this issue, we are migrating our affected cluster to a new image (Cilium), so I hope the issue does not affect that image, since it uses a different kernel

akorp avatar Aug 18 '25 16:08 akorp

Update, we started to use Kernel 6.6.92.2-2.azl3 (osImage: Microsoft Azure Linux 3.0, and so far no issues have been observed. It's only been two days but things are looking promising.

peterD4N avatar Aug 26 '25 14:08 peterD4N

We observed this issue on our own AWS EC2 instances recently.

Kernel: 5.15.0-1089-aws Falco Image: docker.io/falcosecurity/falco:0.41.3 Helm chart version: falco-6.2.5

mischmi2 avatar Sep 15 '25 15:09 mischmi2

Tentatively for /milestone 0.42.0

But expect this to be postponed to 0.43 or an intermediate patch release.

leogr avatar Sep 24 '25 08:09 leogr