ROCK-Kernel-Driver
ROCK-Kernel-Driver copied to clipboard
NULL pointer dereference in kfd_dbgmgr_wave_control
Calling hsaKmtDbgWavefrontControl causes kernel bug. Seems after this rocm is somehow "blocked" and system cannot be soft-rebooted so probably some locked mutex was not unlocked.
main.cpp:
#include <hc.hpp>
#include <hsa.h>
#include <hsakmt.h>
int main()
{
hc::accelerator_view view = hc::accelerator().get_default_view();
hsa_agent_t agent = *static_cast<hsa_agent_t*>(view.get_hsa_agent());
unsigned int node;
hsa_agent_get_info(agent, HSA_AGENT_INFO_NODE, &node);
HsaDbgWaveMessage msg = {0};
hsaKmtDbgWavefrontControl(node, HSA_DBG_WAVEOP_TRAP, HSA_DBG_WAVEMODE_SINGLE, 2, &msg);
return 0;
}
Run:
hcc -hc -lhsa-runtime64 -lhsakmt main.cpp
./a.out
dmesg:
[ 279.910283] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 279.910345] IP: kfd_dbgmgr_wave_control+0x12/0x60 [amdgpu]
[ 279.910347] PGD 7e8155067 P4D 7e8155067 PUD 81419b067 PMD 0
[ 279.910352] Oops: 0000 [#1] SMP NOPTI
[ 279.910422] CPU: 17 PID: 7520 Comm: a.out Tainted: G OE 4.15.0-45-generic #48-Ubuntu
[ 279.910424] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018
[ 279.910477] RIP: 0010:kfd_dbgmgr_wave_control+0x12/0x60 [amdgpu]
[ 279.910478] RSP: 0018:ffff9c339056fd28 EFLAGS: 00010246
[ 279.910481] RAX: ffff8dee7ce4b800 RBX: ffff9c339056fdb0 RCX: 0000000000000000
[ 279.910482] RDX: 000000000000800b RSI: ffff9c339056fd38 RDI: 0000000000000000
[ 279.910484] RBP: ffff9c339056fd28 R08: ffff9c3390570000 R09: 0000000000000020
[ 279.910485] R10: 0000000000000020 R11: 0000000000000fa0 R12: ffff8deebcf27800
[ 279.910486] R13: ffff8dee760cb440 R14: ffff8dee7ce4b800 R15: ffff8dee82a73200
[ 279.910489] FS: 00007f284a99ec00(0000) GS:ffff8deedcc40000(0000) knlGS:0000000000000000
[ 279.910490] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 279.910492] CR2: 0000000000000000 CR3: 000000084e608000 CR4: 00000000003406e0
[ 279.910493] Call Trace:
[ 279.910544] kfd_ioctl_dbg_wave_control+0x120/0x1a0 [amdgpu]
[ 279.910593] kfd_ioctl+0x271/0x450 [amdgpu]
[ 279.910640] ? kfd_ioctl_destroy_queue+0x70/0x70 [amdgpu]
[ 279.910645] ? __handle_mm_fault+0x478/0x5c0
[ 279.910650] do_vfs_ioctl+0xa8/0x630
[ 279.910652] ? handle_mm_fault+0xb1/0x1f0
[ 279.910655] ? __do_page_fault+0x270/0x4d0
[ 279.910658] SyS_ioctl+0x79/0x90
[ 279.910662] do_syscall_64+0x73/0x130
[ 279.910666] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 279.910668] RIP: 0033:0x7f2848e1c5d7
[ 279.910670] RSP: 002b:00007ffd97cd0f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 279.910672] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f2848e1c5d7
[ 279.910673] RDX: 00000000010b6600 RSI: 0000000040104b10 RDI: 0000000000000003
[ 279.910675] RBP: 00000000010b6600 R08: 00007ffd97cd0fd0 R09: 0000000000000000
[ 279.910676] R10: 0000000001003010 R11: 0000000000000246 R12: 0000000040104b10
[ 279.910677] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[ 279.910679] Code: c7 c8 bf 83 c0 e8 bf 0d 28 e5 48 c7 c0 ea ff ff ff eb d2 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 8b 06 48 89 e5 8b 90 90 00 00 00 <39> 17 75 11 48 8b 7f 10 48 8b 47 38 e8 9d fe 9b e5 48 98 5d c3
[ 279.910759] RIP: kfd_dbgmgr_wave_control+0x12/0x60 [amdgpu] RSP: ffff9c339056fd28
[ 279.910760] CR2: 0000000000000000
[ 279.910763] ---[ end trace 33bd6cf8014cbbaf ]---
@misos1 Apologies for the lack of response. Can you please check if your issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!
@misos1 Closing ticket. Please feel free to re-open ticket if you still see the issue with the latest ROCm. Thanks!
Yes I forgot, this seems to be resolved now, also #71.