talos icon indicating copy to clipboard operation
talos copied to clipboard

Talos v1.7.0 `mlx5_core` kernel panic

Open buroa opened this issue 1 year ago • 6 comments

Bug Report

Getting a kernel panic with my nodes that use Mellanox cards.

Description

Talos boots and then once it gets to initializing the network, it kernel panics. If I disable the Mellanox PCIe card from the BIOS, Talos boots fine.

Logs

  <TASK>
  ? __die+ox23/0x70
  ? page_fault_oops+0x171/0x4c0
  ? exc_page_fault+0x171/0x130
  ? asm_exc_page_fault+0x26/0x30
  ? esw_port_metadata_get+0x19/0x30 [mlx5_core]
  ? __alloc_skb+0x8c/0x1b0
  devlink_param_notify.constprop.0+0x72/0xd0
  devl_params_register+0x130/0x2d0
  esw_offloads_init+0x165/0x180 [mlx5_core]
  mlx5_eswitch_init+03b2/0x650 [mlx5_core]
  mlx5_init_one_devl_locked+016d/0670 [mlx5_core]
  probe_one+0x325/0x4a0 [mlx5_core]
  local_pci_probe+0x42/0xa0
  work_for_cpu_fn+0x17/0x30
  process_one_work+0x176/0x310
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xcd/0x100
  ? __pfx_kthread+0x10/0x10
  ref_from_fork+0x31/0x50
  ? __pfx_kthread+0x10+0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>
Modules linked in: wdat_wdt mlx5_core(+) ahci watchdog i2c_i801 lpc_ich mlxfw libahci mfd_core i2c_smbus
---[end trace 0000000000000000 ]---
RIP: 0010:esw_port_metadata_get+0x19/0x30 [mlx5_core]
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x28c0000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Environment

  • Talos version: v1.7.0
  • Kubernetes version: N/A
  • Platform: Metal

buroa avatar Apr 19 '24 18:04 buroa

Are you using any system extensions?

smira avatar Apr 19 '24 18:04 smira

Yes, intel-ucode, nonfree-kmod-nvidia and nvidia-container-toolkit.

This may be relevant as well: https://lore.kernel.org/netdev/[email protected]/T/

buroa avatar Apr 19 '24 18:04 buroa

yep, might be fixed in future Linux 6.6 releases

smira avatar Apr 19 '24 18:04 smira

Seems to be reported upstream already: https://lore.kernel.org/lkml/[email protected]/

smira avatar Apr 22 '24 10:04 smira

Seems to be reported upstream already: lore.kernel.org/lkml/[email protected]

Thanks for keeping up on this, really appreciate it.

buroa avatar Apr 22 '24 11:04 buroa

Looks like 6.6.29 got mlx5 updates https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.6.29

smira avatar Apr 29 '24 11:04 smira

Great news, thanks @smira!

buroa avatar May 01 '24 16:05 buroa