rocm_sdk_builder
rocm_sdk_builder copied to clipboard
gfx1103 (7840U): HW Exception by GPU node-1
I'm still having this random GPU Hang on my 7840U (gfx1103) and not on my 6800U (forced to gfx1030):
HW Exception by GPU node-1 (Agent handle: 0x5ab48bbcc960) reason :GPU Hang
I've been racking my head to figure out what's causing it. Deleting sections of my code. Trying to build a minimum crashing sample to provide. But sometimes it takes running many iterations of the processing I'm doing and sometimes it crashes right up front. There's a lot of code to go through, so I'm still trying narrow things down. But my guess is that the crash occurs as a result of the state of the GPU rather than the actual instruction, which makes things much trickier.
Maybe there's something much more obvious to you or an easier way to track down the issue
Some commands it has crashed on:
-
torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True,return_complex=False).to(device)
-
torch.zeros([*batch_dims, c, n - f, t]).to(device)
-
torch.istft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True)
-
torch.cuda.synchronize()
Here's the kernel log with a few of these crashes
2024-08-18T01:19:27.141093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:27.141108+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
2024-08-18T01:19:27.141109+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:19:27.141110+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
2024-08-18T01:19:27.141111+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
2024-08-18T01:19:27.141111+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:19:27.141112+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 45725
2024-08-18T01:19:29.149118+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:29.149134+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:19:31.153110+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:31.153120+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:19:31.155110+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:19:31.155120+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:19:31.155122+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:19:31.191082+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:19:31.191086+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:19:31.191087+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:19:31.194062+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:19:31.196063+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:19:31.202087+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:19:31.202089+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:19:31.202090+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:19:31.202090+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:19:31.202091+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:19:31.202091+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:19:31.202092+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:19:31.202092+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:19:31.202093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:19:31.202093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:19:31.202093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:19:31.202094+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:19:31.202094+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:19:31.203062+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:19:31.203064+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:19:31.203065+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(3) succeeded!
2024-08-18T01:19:31.351136+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:31.351156+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1000
2024-08-18T01:19:31.351158+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:19:31.351159+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to remove queue 0
2024-08-18T01:19:31.351160+00:00 minipc kernel: amdgpu: Resetting wave fronts (cpsch) on dev 000000000d034e53
2024-08-18T01:19:31.351160+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: no vmid pasid mapping supported
2024-08-18T01:19:31.352108+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:19:31.358069+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:19:31.359069+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:19:31.359072+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:19:31.395080+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:19:31.396064+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:19:31.396066+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:19:31.397084+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:19:31.400201+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:19:31.406237+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:19:31.406248+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:19:31.406250+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:19:31.406251+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:19:31.406253+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:19:31.406254+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:19:31.406255+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:19:31.406256+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:19:31.406257+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:19:31.406257+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:19:31.406258+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:19:31.406259+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:19:31.406260+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:19:31.408175+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:19:31.408185+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:19:31.408186+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(4) succeeded!
2024-08-18T01:20:57.766084+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:20:57.766102+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
2024-08-18T01:20:57.766103+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:20:57.766104+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
2024-08-18T01:20:57.766104+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
2024-08-18T01:20:57.766105+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:20:57.766105+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 45725
2024-08-18T01:20:58.945078+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to remove queue 0
2024-08-18T01:20:59.773318+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:20:59.773338+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:01.778088+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:01.778107+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:01.780090+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:21:01.780097+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:21:01.780098+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:21:01.815205+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:21:01.816084+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:21:01.816091+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:21:01.818100+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:21:01.820479+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:21:01.825115+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:21:01.825125+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:21:01.825127+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:21:01.825129+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:21:01.825130+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:21:01.825131+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:21:01.825132+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:21:01.825133+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:21:01.825134+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:21:01.825135+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:21:01.825135+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:21:01.825136+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:21:01.825152+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:21:01.826104+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:21:01.826114+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:21:01.826116+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(5) succeeded!
2024-08-18T01:21:36.676703+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:36.676722+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
2024-08-18T01:21:36.676724+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:21:36.676725+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
2024-08-18T01:21:36.676726+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
2024-08-18T01:21:36.676728+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:21:36.676739+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 45725
2024-08-18T01:21:37.851129+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to remove queue 0
2024-08-18T01:21:38.685097+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:38.685112+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:40.689195+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:40.689207+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:40.691116+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:21:40.691128+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:21:40.691129+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:21:40.715712+00:00 minipc kernel: workqueue: kfd_process_wq_release [amdgpu] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
2024-08-18T01:21:40.726112+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:21:40.726118+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:21:40.726119+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:21:40.728102+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:21:40.730112+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:21:40.735126+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:21:40.735136+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:21:40.735138+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:21:40.735139+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:21:40.735140+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:21:40.735141+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:21:40.735142+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:21:40.735143+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:21:40.735144+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:21:40.735145+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:21:40.735146+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:21:40.735147+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:21:40.735163+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:21:40.737112+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:21:40.737122+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:21:40.737123+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(6) succeeded!
2024-08-18T01:41:34.118180+00:00 minipc kernel: workqueue: kfd_process_wq_release [amdgpu] hogged CPU for >10000us 5 times, consider switching to WQ_UNBOUND
2024-08-18T01:49:08.684127+00:00 minipc kernel: workqueue: kfd_process_wq_release [amdgpu] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND