Talos v1.7.0 `mlx5_core` kernel panic
Bug Report
Getting a kernel panic with my nodes that use Mellanox cards.
Description
Talos boots and then once it gets to initializing the network, it kernel panics. If I disable the Mellanox PCIe card from the BIOS, Talos boots fine.
Logs
<TASK>
? __die+ox23/0x70
? page_fault_oops+0x171/0x4c0
? exc_page_fault+0x171/0x130
? asm_exc_page_fault+0x26/0x30
? esw_port_metadata_get+0x19/0x30 [mlx5_core]
? __alloc_skb+0x8c/0x1b0
devlink_param_notify.constprop.0+0x72/0xd0
devl_params_register+0x130/0x2d0
esw_offloads_init+0x165/0x180 [mlx5_core]
mlx5_eswitch_init+03b2/0x650 [mlx5_core]
mlx5_init_one_devl_locked+016d/0670 [mlx5_core]
probe_one+0x325/0x4a0 [mlx5_core]
local_pci_probe+0x42/0xa0
work_for_cpu_fn+0x17/0x30
process_one_work+0x176/0x310
? __pfx_worker_thread+0x10/0x10
kthread+0xcd/0x100
? __pfx_kthread+0x10/0x10
ref_from_fork+0x31/0x50
? __pfx_kthread+0x10+0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
Modules linked in: wdat_wdt mlx5_core(+) ahci watchdog i2c_i801 lpc_ich mlxfw libahci mfd_core i2c_smbus
---[end trace 0000000000000000 ]---
RIP: 0010:esw_port_metadata_get+0x19/0x30 [mlx5_core]
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x28c0000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Environment
- Talos version: v1.7.0
- Kubernetes version: N/A
- Platform: Metal
Are you using any system extensions?
Yes, intel-ucode, nonfree-kmod-nvidia and nvidia-container-toolkit.
This may be relevant as well: https://lore.kernel.org/netdev/[email protected]/T/
yep, might be fixed in future Linux 6.6 releases
Seems to be reported upstream already: https://lore.kernel.org/lkml/[email protected]/
Seems to be reported upstream already: lore.kernel.org/lkml/[email protected]
Thanks for keeping up on this, really appreciate it.
Looks like 6.6.29 got mlx5 updates https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.6.29
Great news, thanks @smira!