watchdog: BUG: soft lockup - CPU#4 stuck for 27s! (in Azure)
Describe the bug
Incident causing kernel bug:
[REDACTED DATE] [REDACTED_MACHINE] kernel: [54707.204594] watchdog: BUG: soft lockup - CPU#4 stuck for 27s! [falco:20765]
[REDACTED DATE] [REDACTED_MACHINE] kernel: [54707.208746] Modules linked in: xt_set ip_set_hash_ipport xt_recent xt_statistic xt_nfacct ipt_REJECT nf_reject_ipv4 xt_multiport xt_nat
nfnetlink_acct ipt_rpfilter iptable_raw ip_set_hash_net ip_set_hash_ip ip_set wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 libchacha poly1305_x86_64 ip6_udp_tunnel udp_tunne
l xt_MASQUERADE xt_addrtype nf_tables ip6table_filter ip6table_nat ip6table_mangle ip6_tables xt_comment iptable_nat nf_nat iptable_mangle ebtable_nat ebtables br_netfilter xt_owner iptable_security xt_conntrack
iptable_filter ip_tables overlay cfg80211 xsk_diag nf_conntrack_netlink nfnetlink mlx5_ib ib_uverbs ib_core mlx5_core mlxfw intel_rapl_msr intel_rapl_common skx_edac_common nfit kvm_intel kvm irqbypass hyperv_fb
evdev mousedev sch_fq_codel fuse configfs hid_generic hv_storvsc crc32c_intel scsi_transport_fc hid_hyperv hid hv_netvsc hv_utils hyperv_keyboard hv_balloon nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 autofs4
[REDACTED DATE] [REDACTED_MACHINE] kernel: [54707.208797] CPU: 4 PID: 20765 Comm: falco Kdump: loaded Not tainted 5.15.182.1-1.cm2 #1
How to reproduce it
I did not manage to reproduce the error.
Expected behaviour
No Kernel bug
Screenshots
Environment
- Falco version: 0.39.2
- System info:
- Cloud provider or hardware configuration: Azure
- OS:
- Kernel: 5.15.182.1-1.cm2
- Installation method: kubernetes
Additional context
Falco config:
append_output: []
base_syscalls:
custom_set: []
repair: false
buffered_outputs: false
config_files:
- /etc/falco/config.d
engine:
kind: modern_ebpf
modern_ebpf:
buf_size_preset: 7
cpus_for_each_buffer: 4
drop_failed_exit: false
falco_libs:
thread_table_size: 262144
file_output:
enabled: false
filename: ./events.txt
keep_alive: false
grpc:
bind_address: unix:///run/falco/falco.sock
enabled: false
threadiness: 0
grpc_output:
enabled: false
http_output:
ca_bundle: ""
ca_cert: ""
ca_path: /etc/falco/certs/
client_cert: /etc/falco/certs/client/client.crt
client_key: /etc/falco/certs/client/client.key
compress_uploads: false
echo: false
enabled: false
insecure: false
keep_alive: false
mtls: false
url: ""
user_agent: falcosecurity/falco
json_include_message_property: false
json_include_output_property: true
json_include_tags_property: true
json_output: true
libs_logger:
enabled: false
severity: error
load_plugins: []
log_level: error
log_stderr: true
log_syslog: false
metrics:
convert_memory_to_mb: true
enabled: false
include_empty_values: false
interval: 1h
kernel_event_counters_enabled: true
kernel_event_counters_per_cpu_enabled: false
libbpf_stats_enabled: true
output_rule: true
resource_utilization_enabled: true
rules_counters_enabled: true
state_counters_enabled: true
output_timeout: 2000
outputs_queue:
capacity: 0
plugins:
- init_config: null
library_path: libk8saudit.so
name: k8saudit
open_params: http://:9765/k8s-audit
- library_path: libcloudtrail.so
name: cloudtrail
- init_config: ""
library_path: libjson.so
name: json
priority: debug
program_output:
enabled: false
keep_alive: false
program: 'jq ''{text: .output}'' | curl -d @- -X POST https://hooks.slack.com/services/XXX'
rule_matching: first
rules_files:
- /etc/falco/falco_rules.yaml
- /etc/falco/falco-incubating_rules.yaml
- /etc/falco/falco-sandbox_rules.yaml
stdout_output:
enabled: true
syscall_event_drops:
actions:
- log
- alert
max_burst: 1
rate: 0.03333
simulate_drops: false
threshold: 0.1
syscall_event_timeouts:
max_consecutives: 1000
syslog_output:
enabled: false
time_format_iso_8601: true
watch_config_files: true
webserver:
enabled: true
k8s_healthz_endpoint: /healthz
listen_port: 8765
prometheus_metrics_enabled: false
ssl_certificate: /etc/falco/falco.pem
ssl_enabled: false
threadiness: 0
FYI: we upgraded to 0.40 and we have the same issue.
We ran into the same problem in our AKS cluster.
Cluster details:
Kubernetes: 1.30.12
Node OS: AKSAzureLinux
Node OS version: CBL-Mariner 2.0.20250602
Kernel: 5.15.182.1-1.cm2
Falco details:
Falco chart: falco-4.21.3
Falco version: 0.40.0
Error:
Jul 17 04:39:35 redacted-vm-name kernel: watchdog: BUG: soft lockup - CPU#16 stuck for 26s! [falco:3774060]
Jul 17 04:39:35 redacted-vm-name kernel: Modules linked in: nft_chain_nat xfrm_user xfrm_algo cmac cifs cifs_arc4 cifs_md4 dns_resolver xt_REDIRE
CT xt_multiport ip_set_hash_ipport xt_set ip_set_hash_net ip_set_list_set xt_recent xt_statistic xt_MASQUERADE ipt_REJECT nf_reject_ipv4 ip_set xt_nat xt_add
rtype nft_counter nft_compat nf_tables ip6table_filter ip6table_nat ip6table_mangle ip6_tables xt_comment iptable_nat nf_nat ebtable_nat ebtables iptable_man
gle br_netfilter xt_owner iptable_security xt_conntrack mlx5_ib ib_uverbs ib_core mlx5_core mlxfw xsk_diag nf_conntrack_netlink nfnetlink iptable_filter ip_t
ables overlay cfg80211 intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass hyperv_fb evdev mousedev sch_fq_codel fuse configfs hid_generic hid_hyperv hv
_storvsc hid crc32c_intel hv_netvsc scsi_transport_fc hv_utils hv_balloon hyperv_keyboard nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 autofs4
Jul 17 04:39:35 <redacted_vm_name>: CPU: 16 PID: 3774060 Comm: falco Kdump: loaded Not tainted 5.15.182.1-1.cm2 #1
Jul 17 04:39:35 redacted-vm-name kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/
23/2024
Jul 17 04:39:35 redacted-vm-name kernel: RIP: 0010:do_task_stat+0xc27/0xe20
Jul 17 04:39:35 redacted-vm-name kernel: Code: 10 02 00 00 4c 8b ab c0 01 00 00 48 89 d0 4c 03 a0 c0 0a 00 00 4c 03 b8 c8 0a 00 00 4c 03 a8 80 0a
00 00 48 8b 80 30 0a 00 00 <48> 2d 30 0a 00 00 48 39 c2 75 d9 89 8d f0 fe ff ff 4c 89 85 68 ff
Jul 17 04:39:35 redacted-vm-name kernel: RSP: 0018:ffff9ddc8619b8c8 EFLAGS: 00000246
Jul 17 04:39:35 redacted-vm-name kernel: RAX: ffff8b88fe268a30 RBX: ffff8b88dae26a40 RCX: 0000000000000000
Jul 17 04:39:35 redacted-vm-name kernel: RDX: ffff8b8a52584000 RSI: 0000000000000001 RDI: ffff8b8901535280
Jul 17 04:39:35 redacted-vm-name kernel: RBP: ffff9ddc8619b9d8 R08: ffff8b88c1e31c00 R09: ffff8b88c14ed258
Jul 17 04:39:35 redacted-vm-name kernel: R10: 0000000000000096 R11: 2000000181005ccf R12: 0000a1d55601f4e3
Jul 17 04:39:35 redacted-vm-name kernel: R13: 0000000000000000 R14: 0000000000002e78 R15: 00000021f5b5befd
Jul 17 04:39:35 redacted-vm-name kernel: FS: 0000788795e452c0(0000) GS:ffff8ba83fa00000(0000) knlGS:0000000000000000
Jul 17 04:39:35 redacted-vm-name kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 17 04:39:35 redacted-vm-name kernel: CR2: 00007142d84c3f40 CR3: 0000000de0812005 CR4: 0000000000370ee0
Jul 17 04:39:35 redacted-vm-name kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 17 04:39:35 redacted-vm-name kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 17 04:39:35 redacted-vm-name kernel: Call Trace:
Jul 17 04:39:35 redacted-vm-name kernel: <TASK>
Jul 17 04:39:35 redacted-vm-name kernel: proc_tgid_stat+0x14/0x20
Jul 17 04:39:35 redacted-vm-name kernel: proc_single_show+0x52/0xc0
Jul 17 04:39:35 redacted-vm-name kernel: seq_read_iter+0x124/0x4b0
Jul 17 04:39:35 redacted-vm-name kernel: seq_read+0xf2/0x140
Jul 17 04:39:35 redacted-vm-name kernel: ? 0xffffffffb2000000
Jul 17 04:39:35 redacted-vm-name kernel: vfs_read+0x9f/0x1a0
Jul 17 04:39:35 redacted-vm-name kernel: ksys_read+0x67/0xf0
Jul 17 04:39:35 redacted-vm-name kernel: __x64_sys_read+0x19/0x20
Jul 17 04:39:35 redacted-vm-name kernel: x64_sys_call+0x1dba/0x1fa0
Jul 17 04:39:35 redacted-vm-name kernel: do_syscall_64+0x56/0x80
Jul 17 04:39:35 redacted-vm-name kernel: ? putname+0x59/0x70
Jul 17 04:39:35 redacted-vm-name kernel: ? kmem_cache_free+0x243/0x290
Jul 17 04:39:35 redacted-vm-name kernel: ? _copy_to_user+0x20/0x30
Jul 17 04:39:35 redacted-vm-name kernel: ? cp_new_stat+0x152/0x180
Jul 17 04:39:35 redacted-vm-name kernel: ? __do_sys_newfstatat+0x49/0x70
Jul 17 04:39:35 redacted-vm-name kernel: ? bpf_prog_6ad5a3e64cf82799_sys_exit+0x65/0x8ec
Jul 17 04:39:35 redacted-vm-name kernel: ? bpf_trace_run2+0x3d/0xa0
Jul 17 04:39:35 redacted-vm-name kernel: ? __bpf_trace_sys_exit+0x9/0x10
Jul 17 04:39:35 redacted-vm-name kernel: ? syscall_exit_to_user_mode+0x2c/0x50
Jul 17 04:39:35 redacted-vm-name kernel: ? do_syscall_64+0x63/0x80
Jul 17 04:39:35 redacted-vm-name kernel: ? bpf_ringbuf_output+0x4a/0xf0
Jul 17 04:39:35 redacted-vm-name kernel: ? bpf_prog_9cee710a6b25781f_openat_x+0xc47/0x1f0c
Jul 17 04:39:35 redacted-vm-name kernel: ? bpf_trace_run2+0x3d/0xa0
Jul 17 04:39:35 redacted-vm-name kernel: ? __bpf_trace_sys_exit+0x9/0x10
Jul 17 04:39:35 redacted-vm-name kernel: ? syscall_exit_to_user_mode+0x2c/0x50
Jul 17 04:39:35 redacted-vm-name kernel: ? do_syscall_64+0x63/0x80
Jul 17 04:39:35 redacted-vm-name kernel: ? irqentry_exit_to_user_mode+0xe/0x20
Jul 17 04:39:35 redacted-vm-name kernel: ? irqentry_exit+0x1d/0x30
Jul 17 04:39:35 redacted-vm-name kernel: ? sysvec_hyperv_stimer0+0x4e/0x90
Jul 17 04:39:35 redacted-vm-name kernel: entry_SYSCALL_64_after_hwframe+0x6c/0xd6
Jul 17 04:39:35 redacted-vm-name kernel: RIP: 0033:0x788795f4b25c
Jul 17 04:39:35 redacted-vm-name kernel: Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 d9 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89
c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 2f d6 f8 ff 48
Jul 17 04:39:35 redacted-vm-name kernel: RSP: 002b:00007ffc419e29a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Jul 17 04:39:35 redacted-vm-name kernel: RAX: ffffffffffffffda RBX: 00007887945a5000 RCX: 0000788795f4b25c
Jul 17 04:39:35 redacted-vm-name kernel: RDX: 0000000000000400 RSI: 000078876b877800 RDI: 0000000000000178
Jul 17 04:39:35 redacted-vm-name kernel: RBP: 00007887960225e0 R08: 0000000000000000 R09: 0000788795e40740
Jul 17 04:39:35 redacted-vm-name kernel: R10: 0000000000001000 R11: 0000000000000246 R12: 00007ffc419e2b60
Jul 17 04:39:35 redacted-vm-name kernel: R13: 0000000000000d68 R14: 00007887960219e0 R15: 0000000000000d68
Jul 17 04:39:35 redacted-vm-name kernel: </TASK>
This leads to node restarts.
Hey, thanks for letting us know about the issue! Is everyone using modern ebpf driver here?
If that's the case, latest Falco (0.41.3) ships with https://github.com/falcosecurity/libs/pull/2397 that might fix the issue.
From the shared dmesg, actually it sems the issue lies in openat bpf program:
Jul 17 04:39:35 redacted-vm-name kernel: ? bpf_prog_9cee710a6b25781f_openat_x+0xc47/0x1f0c
That is really weird. I'll give it a look! Thanks again for all the info!
@FedeDP yeah we are using modern_ebpf
Is latest Falco giving the same issue?
Anyway, since both of the reports come from the exact same kernel, it might be a kernel bug too. Did you ever experienced the same issue on older kernels?
It all started in 0.39
Our problem started 8th of July 2025 looks like after kernel patch. So problem might be related to kernel. According the logs kernel versions changed.
Before:
"containerd://1.6.26","kernelVersion":"5.15.180.1-1.cm2","kubeProxyVersion":"v1.30.12"
After:
"containerd://1.6.26","kernelVersion":"5.15.182.1-1.cm2","kubeProxyVersion":"v1.30.12"
@FedeDP we run the latest version (0.41.3) of Falco which includes https://github.com/falcosecurity/libs/pull/2397.
Thanks for the info! Are you able to rollback the kernel version by chance?
In AKS rollback is not supported. It is possible to spin a new node pool with older node kernel version and then migrate workloads but that will introduce other internal issues (IP, HW capacity...). We have these problems only in prod cluster where testing is not easy.
I see; it was worth a try :)
I am really not sure, if @jcchavezs confirms the timing of the issue with the kernel upgrade, i think kernel bug stocks rise :)
kernel crashes happened on nodes with kernel version 5.15.182.1-1.cm2, for the previous version (namely 5.15.180.1-1.cm2) we don't have any report of crashes. The issue for us started around July 2th.
for the previous version (namely 5.15.180.1-1.cm2) we don't have any report of crashes.
Thanks! Checking https://packages.microsoft.com/yumrepos/cbl-mariner-2.0-prod-base-srpms/Packages/k/ (is it the correct one?) i see
kernel-5.15.184.1-1.cm2.src.rpm 09-Jul-2025 00:59 196.1 MB
I see that there is already a new version out there that contains the .184 kernel :https://github.com/microsoft/azurelinux/releases/tag/2.0.20250701-2.0, hopefully it will be released soon and fixes the issue. 🙏
Today we got a kernel update. We're now on version 5.15.184.1-1.cm2. Let's see if that helps 🤞 ,
Unfortunately even after applying the kernel patch the issue persists. We experienced node restarts both yesterday and today.
Do I get this correct, that this is more of a kernel issue, than it is a issue of falco?
It should be; i mean considering only people on azure linux are experiencing the issue, starting from 5.15.182.1-1.cm2 kernel.
We had a case open with Microsoft initially on this issue (crashing linux which takes nodes down because of watchdog: BUG: soft lockup - CPU#), but after MS found out that everything works fine with disabled Falco they asked us to talk with Falco directly to resolve it and closed the case (I work together with peterD4N in the same company).
We also used to communicate with MS through https://github.com/Azure/AKS/issues on previous issues related to azure k8s images. It should be possible to open an issue there, but not sure that we have good knowledge on this topic to open and describe the issue properly.
@akorp question: are you able to try a different driver, eg: kmod or legacy ebpf probe?
i mean considering only people on azure linux are experiencing the issue, starting from
5.15.182.1-1.cm2kernel.
We're also experiencing this issue on rke2 clusters running on bare metal with kernel 5.15.0-144-generic.
Details:
Kubernetes Version: v1.31.7+rke2r1
Falco Version: docker.io/falcosecurity/falco:0.41.3
Kernel Version: 5.15.0-144-generic
Node OS / Version: Ubuntu 22.04.5
I tested older falco versions which caused the same issue.
However Ubuntu 24.04.3 with kernel 6.8.0-64-generic does not show this issue.
Thanks for the hint @michaelSchmidMaloon; i think we can then rule out the "kernel issue" option. Can you share the dmesg error?
I think this is a kernel issue , i saw this bug on the ubuntu bug forums as well https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2118407.
Seems like the issue was caused by this kernel change https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?h=Ubuntu-5.15.0-144.154&id=8274eb7192d7bf36c18a026f89164fa99a5dcb99 and fixed by this commit in the kernel https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.15.y&id=32caa50275cc52a382ca77b53ce3d6204d509e14 .
IMO we can see if mariner also did similar change on their side and didnt added the backport fix. If this is the case we can ask them to add the backport fix.
@FedeDP @michaelSchmidMaloon
Great to hear @gargayush18 ! Thank you very much for the info!
yeah @FedeDP we still need to confirm if this is the case with the current mariner version. I just took as an assumption.
@akorp question: are you able to try a different driver, eg: kmod or legacy ebpf probe?
Sorry for the delayed reply. Independently of this issue, we are migrating our affected cluster to a new image (Cilium), so I hope the issue does not affect that image, since it uses a different kernel
Update, we started to use Kernel 6.6.92.2-2.azl3 (osImage: Microsoft Azure Linux 3.0, and so far no issues have been observed. It's only been two days but things are looking promising.
We observed this issue on our own AWS EC2 instances recently.
Kernel: 5.15.0-1089-aws
Falco Image: docker.io/falcosecurity/falco:0.41.3
Helm chart version: falco-6.2.5
Tentatively for /milestone 0.42.0
But expect this to be postponed to 0.43 or an intermediate patch release.