gatekeeper icon indicating copy to clipboard operation
gatekeeper copied to clipboard

PANIC in rd_fill_getroute_reply

Open andrenth opened this issue 2 years ago • 3 comments

This message has showed up twice in the logs during the current testing period of version 1.1:

PANIC in rd_fill_getroute_reply():
Invalid FIB action (6) in FIB while being processed by CPS block in rd_fill_getroute_reply

The following kernel logs have appeared at the same time in kern.log:

Jul 19 05:01:25 gtk1 kernel: [395312.386002] show_signal: 1 callbacks suppressed
Jul 19 05:01:25 gtk1 kernel: [395312.386004] traps: lcore-worker-8[14109] general protection fault ip:7fcd8f7f4c50 sp:7fcd8c3e7050 error:0 in libgcc_s.so.1[7fcd8f7e8000+12000]
Jul 19 05:01:28 gtk1 kernel: [395315.776939] BUG: kernel NULL pointer dereference, address: 0000000000000010
Jul 19 05:01:28 gtk1 kernel: [395315.777020] #PF: supervisor read access in kernel mode
Jul 19 05:01:28 gtk1 kernel: [395315.777064] #PF: error_code(0x0000) - not-present page
Jul 19 05:01:28 gtk1 kernel: [395315.777108] PGD 0 P4D 0 
Jul 19 05:01:28 gtk1 kernel: [395315.777138] Oops: 0000 [#1] SMP PTI
Jul 19 05:01:28 gtk1 kernel: [395315.777175] CPU: 30 PID: 14131 Comm: lcore-worker-30 Tainted: G           OE     5.4.0-117-generic #132-Ubuntu
Jul 19 05:01:28 gtk1 kernel: [395315.777254] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.06.E006.013120181511 01/31/2018
Jul 19 05:01:28 gtk1 kernel: [395315.777343] RIP: 0010:vmacache_find+0x29/0xc0
Jul 19 05:01:28 gtk1 kernel: [395315.777383] Code: 00 66 66 66 66 90 55 45 31 c0 65 48 8b 0c 25 c0 bb 01 00 48 89 e5 48 3b b9 10 08 00 00 74 05 4c 89 c0 5d c3 f6 41 26 20 75 f5 <48> 8b 47 10 48 3b 81 20 08 00 00 75 44 48 89 f0 ba 04 00 00 00 45
Jul 19 05:01:28 gtk1 kernel: [395315.777523] RSP: 0000:ffffb10c8ebafa00 EFLAGS: 00010246
Jul 19 05:01:28 gtk1 kernel: [395315.777568] RAX: ffff8999b1ac1740 RBX: 0000000000000000 RCX: ffff8999b1ac1740
Jul 19 05:01:28 gtk1 kernel: [395315.777626] RDX: 0000000000000000 RSI: 000000350ba54000 RDI: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.777684] RBP: ffffb10c8ebafa00 R08: 0000000000000000 R09: ffffb10c8ebafbc0
Jul 19 05:01:28 gtk1 kernel: [395315.777742] R10: ffff89999f05ea80 R11: ffff89999f05ea80 R12: 000000350ba54000
Jul 19 05:01:28 gtk1 kernel: [395315.777799] R13: 000000350ba54000 R14: 0000000000000000 R15: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.777858] FS:  00007fcd76fed400(0000) GS:ffff89b9beb80000(0000) knlGS:0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.777922] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 19 05:01:28 gtk1 kernel: [395315.777970] CR2: 0000000000000010 CR3: 0000003fb8e0a003 CR4: 00000000000606e0
Jul 19 05:01:28 gtk1 kernel: [395315.778027] Call Trace:
Jul 19 05:01:28 gtk1 kernel: [395315.778060]  find_vma+0x1b/0x70
Jul 19 05:01:28 gtk1 kernel: [395315.778096]  ? __switch_to_asm+0x34/0x70
Jul 19 05:01:28 gtk1 kernel: [395315.778135]  find_extend_vma+0x22/0x90
Jul 19 05:01:28 gtk1 kernel: [395315.778171]  __get_user_pages+0xc3/0x7d0
Jul 19 05:01:28 gtk1 kernel: [395315.778209]  get_user_pages_remote+0x146/0x230
Jul 19 05:01:28 gtk1 kernel: [395315.778257]  kni_fifo_trans_pa2va+0x1d1/0x2c0 [rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.778305]  kni_net_release_fifo_phy+0x36/0x40 [rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.778352]  kni_dev_remove+0x33/0x40 [rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.778394]  kni_release+0xab/0x160 [rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.778435]  __fput+0xcc/0x260
Jul 19 05:01:28 gtk1 kernel: [395315.778466]  ____fput+0xe/0x10
Jul 19 05:01:28 gtk1 kernel: [395315.778497]  task_work_run+0x8f/0xb0
Jul 19 05:01:28 gtk1 kernel: [395315.778533]  do_exit+0x36e/0xaf0
Jul 19 05:01:28 gtk1 kernel: [395315.778567]  ? _cond_resched+0x19/0x30
Jul 19 05:01:28 gtk1 kernel: [395315.778602]  ? mutex_lock+0x13/0x40
Jul 19 05:01:28 gtk1 kernel: [395315.778636]  ? pipe_wait+0xaf/0xc0
Jul 19 05:01:28 gtk1 kernel: [395315.778670]  do_group_exit+0x47/0xb0
Jul 19 05:01:28 gtk1 kernel: [395315.778706]  get_signal+0x169/0x890
Jul 19 05:01:28 gtk1 kernel: [395315.778742]  do_signal+0x34/0x6c0
Jul 19 05:01:28 gtk1 kernel: [395315.778775]  ? __vfs_read+0x29/0x40
Jul 19 05:01:28 gtk1 kernel: [395315.778809]  ? vfs_read+0xab/0x160
Jul 19 05:01:28 gtk1 kernel: [395315.778845]  exit_to_usermode_loop+0xbf/0x160
Jul 19 05:01:28 gtk1 kernel: [395315.778885]  do_syscall_64+0x163/0x190
Jul 19 05:01:28 gtk1 kernel: [395315.778922]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 19 05:01:28 gtk1 kernel: [395315.778967] RIP: 0033:0x7fcd8ff553cc
Jul 19 05:01:28 gtk1 kernel: [395315.780417] Code: Bad RIP value.
Jul 19 05:01:28 gtk1 kernel: [395315.781845] RSP: 002b:00007fcd76fea2a0 EFLAGS: 00003246 ORIG_RAX: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.783334] RAX: fffffffffffffe00 RBX: 000055d3be8756e0 RCX: 00007fcd8ff553cc
Jul 19 05:01:28 gtk1 kernel: [395315.784807] RDX: 0000000000000001 RSI: 00007fcd76fea2ef RDI: 0000000000000084
Jul 19 05:01:28 gtk1 kernel: [395315.786258] RBP: 00007fcd76fea2ef R08: 0000000000000000 R09: 00007fcd76fea2f0
Jul 19 05:01:28 gtk1 kernel: [395315.787715] R10: 000055d3be083500 R11: 0000000000003246 R12: 000055d3be874060
Jul 19 05:01:28 gtk1 kernel: [395315.789172] R13: 0000000000001680 R14: 000000000000001e R15: 0000000000000087
Jul 19 05:01:28 gtk1 kernel: [395315.790639] Modules linked in: rte_kni(OE) binfmt_misc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl joydev input_leds intel_cstate ipmi_si ipmi_devintf ipmi_msghandler mgag200 drm_vram_helper ttm drm_kms_helper fb_sys_fops syscopyarea sysfillrect sysimgblt mei_me mei ioatdma mac_hid sch_fq_codel uio_pci_generic uio drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 isci hid_generic ixgbe igb usbhid xfrm_algo ahci libsas i2c_algo_bit lpc_ich scsi_transport_sas libahci i2c_i801 crc32_pclmul hid dca mdio wmi [last unloaded: rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.801477] CR2: 0000000000000010
Jul 19 05:01:28 gtk1 kernel: [395315.803041] ---[ end trace 6f8b3699caf10e87 ]---
Jul 19 05:01:28 gtk1 kernel: [395315.851009] RIP: 0010:vmacache_find+0x29/0xc0
Jul 19 05:01:28 gtk1 kernel: [395315.852584] Code: 00 66 66 66 66 90 55 45 31 c0 65 48 8b 0c 25 c0 bb 01 00 48 89 e5 48 3b b9 10 08 00 00 74 05 4c 89 c0 5d c3 f6 41 26 20 75 f5 <48> 8b 47 10 48 3b 81 20 08 00 00 75 44 48 89 f0 ba 04 00 00 00 45
Jul 19 05:01:28 gtk1 kernel: [395315.855858] RSP: 0000:ffffb10c8ebafa00 EFLAGS: 00010246
Jul 19 05:01:28 gtk1 kernel: [395315.857487] RAX: ffff8999b1ac1740 RBX: 0000000000000000 RCX: ffff8999b1ac1740
Jul 19 05:01:28 gtk1 kernel: [395315.859132] RDX: 0000000000000000 RSI: 000000350ba54000 RDI: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.860763] RBP: ffffb10c8ebafa00 R08: 0000000000000000 R09: ffffb10c8ebafbc0
Jul 19 05:01:28 gtk1 kernel: [395315.862411] R10: ffff89999f05ea80 R11: ffff89999f05ea80 R12: 000000350ba54000
Jul 19 05:01:28 gtk1 kernel: [395315.864076] R13: 000000350ba54000 R14: 0000000000000000 R15: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.865754] FS:  00007fcd76fed400(0000) GS:ffff89b9beb80000(0000) knlGS:0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.867461] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 19 05:01:28 gtk1 kernel: [395315.869171] CR2: 00007fcd8ff553a2 CR3: 0000003fb8e0a003 CR4: 00000000000606e0
Jul 19 05:01:28 gtk1 kernel: [395315.870917] Fixing recursive fault but reboot is needed!

An attempt to restart Gatekeeper causes an immediate reboot.

andrenth avatar Jul 19 '22 20:07 andrenth

Pull request #593 logs more information about the memory corruption and allows Gatekeeper to keep running. Pull request #593 is not a solution, but a palliative while we work on a final solution.

AltraMayor avatar Jul 30 '22 18:07 AltraMayor

This issue is the combination of two problems: 1. a bug in the LPM iterator, and 2. a bug in the KNI driver that seems to be triggered when Gatekeeper terminates without releasing the resources associated with the kernel module of the KNI. Pull request #594 addresses the first problem.

AltraMayor avatar Sep 06 '22 19:09 AltraMayor

Given that this issue is no longer reproducible in production, I'm moving to release 1.2 the investigation of the KNI kernel module.

AltraMayor avatar Oct 06 '22 19:10 AltraMayor

Pull request #678 dropped the KNI library, so the last problem no longer exists.

AltraMayor avatar Mar 06 '24 15:03 AltraMayor