ROCK-Kernel-Driver icon indicating copy to clipboard operation
ROCK-Kernel-Driver copied to clipboard

NULL pointer dereference in kfd_dbgmgr_wave_control

Open misos1 opened this issue 6 years ago • 1 comments

Calling hsaKmtDbgWavefrontControl causes kernel bug. Seems after this rocm is somehow "blocked" and system cannot be soft-rebooted so probably some locked mutex was not unlocked.

main.cpp:

#include <hc.hpp>
#include <hsa.h>
#include <hsakmt.h>

int main()
{
	hc::accelerator_view view = hc::accelerator().get_default_view();
	hsa_agent_t agent = *static_cast<hsa_agent_t*>(view.get_hsa_agent());
	unsigned int node;
	hsa_agent_get_info(agent, HSA_AGENT_INFO_NODE, &node);

	HsaDbgWaveMessage msg = {0};
	hsaKmtDbgWavefrontControl(node, HSA_DBG_WAVEOP_TRAP, HSA_DBG_WAVEMODE_SINGLE, 2, &msg);

	return 0;
}

Run:

hcc -hc -lhsa-runtime64 -lhsakmt main.cpp
./a.out

dmesg:

[  279.910283] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  279.910345] IP: kfd_dbgmgr_wave_control+0x12/0x60 [amdgpu]
[  279.910347] PGD 7e8155067 P4D 7e8155067 PUD 81419b067 PMD 0 
[  279.910352] Oops: 0000 [#1] SMP NOPTI
[  279.910422] CPU: 17 PID: 7520 Comm: a.out Tainted: G           OE    4.15.0-45-generic #48-Ubuntu
[  279.910424] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018
[  279.910477] RIP: 0010:kfd_dbgmgr_wave_control+0x12/0x60 [amdgpu]
[  279.910478] RSP: 0018:ffff9c339056fd28 EFLAGS: 00010246
[  279.910481] RAX: ffff8dee7ce4b800 RBX: ffff9c339056fdb0 RCX: 0000000000000000
[  279.910482] RDX: 000000000000800b RSI: ffff9c339056fd38 RDI: 0000000000000000
[  279.910484] RBP: ffff9c339056fd28 R08: ffff9c3390570000 R09: 0000000000000020
[  279.910485] R10: 0000000000000020 R11: 0000000000000fa0 R12: ffff8deebcf27800
[  279.910486] R13: ffff8dee760cb440 R14: ffff8dee7ce4b800 R15: ffff8dee82a73200
[  279.910489] FS:  00007f284a99ec00(0000) GS:ffff8deedcc40000(0000) knlGS:0000000000000000
[  279.910490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  279.910492] CR2: 0000000000000000 CR3: 000000084e608000 CR4: 00000000003406e0
[  279.910493] Call Trace:
[  279.910544]  kfd_ioctl_dbg_wave_control+0x120/0x1a0 [amdgpu]
[  279.910593]  kfd_ioctl+0x271/0x450 [amdgpu]
[  279.910640]  ? kfd_ioctl_destroy_queue+0x70/0x70 [amdgpu]
[  279.910645]  ? __handle_mm_fault+0x478/0x5c0
[  279.910650]  do_vfs_ioctl+0xa8/0x630
[  279.910652]  ? handle_mm_fault+0xb1/0x1f0
[  279.910655]  ? __do_page_fault+0x270/0x4d0
[  279.910658]  SyS_ioctl+0x79/0x90
[  279.910662]  do_syscall_64+0x73/0x130
[  279.910666]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  279.910668] RIP: 0033:0x7f2848e1c5d7
[  279.910670] RSP: 002b:00007ffd97cd0f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  279.910672] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f2848e1c5d7
[  279.910673] RDX: 00000000010b6600 RSI: 0000000040104b10 RDI: 0000000000000003
[  279.910675] RBP: 00000000010b6600 R08: 00007ffd97cd0fd0 R09: 0000000000000000
[  279.910676] R10: 0000000001003010 R11: 0000000000000246 R12: 0000000040104b10
[  279.910677] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[  279.910679] Code: c7 c8 bf 83 c0 e8 bf 0d 28 e5 48 c7 c0 ea ff ff ff eb d2 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 8b 06 48 89 e5 8b 90 90 00 00 00 <39> 17 75 11 48 8b 7f 10 48 8b 47 38 e8 9d fe 9b e5 48 98 5d c3 
[  279.910759] RIP: kfd_dbgmgr_wave_control+0x12/0x60 [amdgpu] RSP: ffff9c339056fd28
[  279.910760] CR2: 0000000000000000
[  279.910763] ---[ end trace 33bd6cf8014cbbaf ]---

misos1 avatar Feb 08 '19 16:02 misos1

@misos1 Apologies for the lack of response. Can you please check if your issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!

ppanchad-amd avatar Aug 19 '24 19:08 ppanchad-amd

@misos1 Closing ticket. Please feel free to re-open ticket if you still see the issue with the latest ROCm. Thanks!

ppanchad-amd avatar Oct 16 '24 15:10 ppanchad-amd

Yes I forgot, this seems to be resolved now, also #71.

misos1 avatar Oct 16 '24 15:10 misos1