Device assert broken on gfx1030 with "Bus error", or hanging after synchronize
If for example i add assert(0); to the kernel in the vectorAdd sample: https://github.com/ROCm-Developer-Tools/HIP-Examples/blob/master/vectorAdd/vectoradd_hip.cpp
via
diff --git a/vectorAdd/vectoradd_hip.cpp b/vectorAdd/vectoradd_hip.cpp
index 0362c8a..a20bd2c 100644
--- a/vectorAdd/vectoradd_hip.cpp
+++ b/vectorAdd/vectoradd_hip.cpp
@@ -47,7 +47,7 @@ __global__ void
vectoradd_float(float* __restrict__ a, const float* __restrict__ b, const float* __restrict__ c, int width, int height)
{
-
+assert(0);
int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;
Then on mi250x I get the expected behavior
......
vectoradd_hip.cpp:50: void vectoradd_float(float *__restrict, const float *__restrict, const float *__restrict, int, int): Device-side assertion `0' failed.
vectoradd_hip.cpp:50: void vectoradd_float(float *__restrict, const float *__restrict, const float *__restrict, int, int): Device-side assertion `0' failed.
:0:rocdevice.cpp :2778: 1891319054196 us: 83888: [tid:0x7fd1f8497700] Callback: Queue 0x7fcfcce00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
make: *** [Makefile:24: test] Aborted (core dumped)
etc
However on gfx1030 using ubuntu 2204.03 and rocm5.7.1 (an officially supported combination), I get:
System minor 3
System major 10
agent prop name AMD Radeon PRO W6800
hip Device prop succeeded
Bus error
i.e. the assert message diagnostic is removed and replaced with "Bus error".
I am using rocm5.7.1 on ubuntu 2204.03 and gfx1030, the only difference is that my card is a RX 6900 XT but I cannot reproduce the issue. I will try to get access to a Radeon PRO W6800, in the meantime can you also confirm the kernel mode driver version you are using:
~/HIP-Examples/vectorAdd$ apt show amdgpu-dkms
Package: amdgpu-dkms Version: 1:6.2.4.50701-1664922.22.04 Priority: optional Section: misc Maintainer: Advanced Micro Devices (AMD) [email protected] Installed-Size: 443 MB Provides: rock-dkms Depends: dkms (>= 1.95), libc-dev | libc6-dev, autoconf, automake, initramfs-tools, shim-signed, amdgpu-dkms-firmware (= 1:6.2.4.50701-1664922.22.04) Conflicts: rock-dkms (<< 1:6.2.4.50701-1664922.22.04) Breaks: rock-dkms (<< 1:6.2.4.50701-1664922.22.04) Replaces: rock-dkms (<< 1:6.2.4.50701-1664922.22.04) Download-Size: 10.2 MB APT-Manual-Installed: yes APT-Sources: https://repo.radeon.com/amdgpu/5.7.1/ubuntu jammy/main amd64 Packages Description: amdgpu driver in DKMS format.
apt show amdgpu-dkms
Thanks
$ apt show amdgpu-dkms Package: amdgpu-dkms Version: 1:6.1.5.50601-1649308.22.04 Priority: optional Section: misc Maintainer: Advanced Micro Devices (AMD) [email protected] Installed-Size: 441 MB Provides: rock-dkms Depends: dkms (>= 1.95), libc-dev | libc6-dev, autoconf, automake, initramfs-tools, shim-signed, amdgpu-dkms-firmware (= 1:6.1.5.50601-1649308.22.04) Conflicts: rock-dkms (<< 1:6.1.5.50601-1649308.22.04) Breaks: rock-dkms (<< 1:6.1.5.50601-1649308.22.04) Replaces: rock-dkms (<< 1:6.1.5.50601-1649308.22.04) Download-Size: 10.1 MB APT-Manual-Installed: yes APT-Sources: https://repo.radeon.com/amdgpu/5.6.1/ubuntu jammy/main amd64 Packages Description: amdgpu driver in DKMS format.
Thanks @JackAKirk I think this is a slightly older version of the driver that seems to correspond to 5.6 so you can try to upgrade that but first can you also check if PCIe atomics are supported on the gfx1030 system and if there is any difference comparing to the MI250 system in that respect. I think that device assert is one of the calls that require PCIe atomics in order to work correctly.
PCIe atomics
I've also verified that the same error occurs if I use rocm5.6.1 with that 5.6.1 driver. I can't easily check the 5.7.1 driver. I think that the gfx1030 system doesn't support PCIe atomics, but I need to check. This won't tell me exactly what I need however. What I really want to know is:
- Does assert 100% require PCIe atomics for all amd cards? If so, is this for llvm.trap (SIGABRT)? and if so which atomic instruction does it need?
- If assert doesn't require PCIe atomics for some amd cards then which ones?
In general are there any amd docs on PCIe atomics requirements for parts of the hip runtime?
Thanks.
@JackAKirk please see https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html The PCIe atomic requirement for ROCm is listed under "CPU support" and it applies to all supported amd cards.
device assert (similar to printf() and device-side malloc) is implemented based on a hostcall service that in turn requires the system to support PCIe atomics. (Although for printf() specifically there is a non-hostcall implementation introduced in 5.7 https://rocm.docs.amd.com/en/docs-5.7.0/release.html#)
@JackAKirk please see https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html The PCIe atomic requirement for ROCm is listed under "CPU support" and it applies to all supported amd cards.
device assert (similar to printf() and device-side malloc) is implemented based on a hostcall service that in turn requires the system to support PCIe atomics. (Although for printf() specifically there is a non-hostcall implementation introduced in 5.7 https://rocm.docs.amd.com/en/docs-5.7.0/release.html#)
Thanks very much for this information.
Could you confirm that this hardware does not support pcie atomics from the lspci output: lspci_root_amdgpu-4.txt
I think the relevant part is probably:
PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] (prog-if 00 [Normal decode]) DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd- AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
?
Thanks
Hi @JackAKirk yes I think that the relevant part is this: AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- can you also provide the output of lspci -t
I can't get lspci -t on that machine. But here is the output on another machine where I have the same issue:
lspci -t -vv
-[0000:00]-+-00.0 Intel Corporation Device 4660
+-01.0-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 GL-XL [Radeon PRO W6800]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 HDMI Audio [Radeon RX 6800/6800 XT / 6900 XT]
+-02.0 Intel Corporation AlderLake-S GT1
+-04.0 Intel Corporation Alder Lake Innovation Platform Framework Processor Participant
+-06.0-[04]----00.0 Toshiba Corporation XG6 NVMe SSD Controller
+-08.0 Intel Corporation 12th Gen Core Processor Gaussian & Neural Accelerator
+-14.0 Intel Corporation Device 7ae0
+-14.2 Intel Corporation Device 7aa7
+-15.0 Intel Corporation Device 7acc
+-16.0 Intel Corporation Device 7ae8
+-17.0 Intel Corporation Device 7ae2
+-1c.0-[05]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader
+-1f.0 Intel Corporation Device 7a88
+-1f.3 Intel Corporation Device 7ad0
+-1f.4 Intel Corporation Device 7aa3
+-1f.5 Intel Corporation Device 7aa4
\-1f.6 Intel Corporation Ethernet Connection (17) I219-LM
Hi @JackAKirk on this new machine can you now check the atomics for 00:01.0
sudo lspci -s 00:01.0 -vv | grep AtomicOpsCap
I expect it will show something like AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- indicating that the atomics are disabled. If that is not the case can you please attach the full lspci verbose output as in the previous machine.
Hi @JackAKirk on this new machine can you now check the atomics for 00:01.0
sudo lspci -s 00:01.0 -vv | grep AtomicOpsCapI expect it will show something like AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- indicating that the atomics are disabled. If that is not the case can you please attach the full lspci verbose output as in the previous machine.
This is the output of the command.
lspci -s 00:01.0 -vv | grep AtomicOpsCap
AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS+
See the full output attached [ lspci_w6800.txt ](url)
Hi @JackAKirk on this new machine can you now check the atomics for 00:01.0
sudo lspci -s 00:01.0 -vv | grep AtomicOpsCapI expect it will show something like AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- indicating that the atomics are disabled. If that is not the case can you please attach the full lspci verbose output as in the previous machine.This is the output of the command.
lspci -s 00:01.0 -vv | grep AtomicOpsCap AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS+See the full output attached [ lspci_w6800.txt ](url)
I'm pretty sure this indicates that the card supports pcie atomics right? So in this case I don't think that can be the issue.
Do you have any unit testing set up for kernel asserts on w6800?
I'm pretty sure this indicates that the card supports pcie atomics right? So in this case I don't think that can be the issue.
Can you please try the following as well:
- enable logging by setting AMD_LOG_LEVEL=4 and re-run the test. Do the logs show up any error related to missing pcie atomics? For example
Pcie atomics not enabled, hostcall not supported - Is there any atomics related error showing in dmesg output after running the test?
I'm pretty sure this indicates that the card supports pcie atomics right? So in this case I don't think that can be the issue.
Can you please try the following as well:
1. enable logging by setting AMD_LOG_LEVEL=4 and re-run the test. Do the logs show up any error related to missing pcie atomics? For example `Pcie atomics not enabled, hostcall not supported` 2. Is there any atomics related error showing in dmesg output after running the test?
Here is the log:
I don't see any errors relating to missing pcie atomics.
I'm pretty sure this indicates that the card supports pcie atomics right? So in this case I don't think that can be the issue.
Can you please try the following as well:
1. enable logging by setting AMD_LOG_LEVEL=4 and re-run the test. Do the logs show up any error related to missing pcie atomics? For example `Pcie atomics not enabled, hostcall not supported` 2. Is there any atomics related error showing in dmesg output after running the test?Here is the log:
I don't see any errors relating to missing pcie atomics.
$ dmesg -wH dmesg: read kernel buffer failed: Operation not permitted
dmesg: read kernel buffer failed: Operation not permitted
Can you try to run dmesg with sudo. Also, can you post the output of this:
grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties
dmesg: read kernel buffer failed: Operation not permitted
Can you try to run dmesg with sudo. Also, can you post the output of this:
grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties
Sure dmesg.txt
# grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties
/sys/class/kfd/kfd/topology/nodes/0/io_links/0/properties:flags 3
/sys/class/kfd/kfd/topology/nodes/1/io_links/0/properties:flags 1
Thanks
@JackAKirk there is no indication of missing pcie atomics from the logs as far as I can see.
- Can you try to call printf from the kernel. Does that show the same problem?
- Can you run the test under a debugger. When it crashes please get the backtrace and post it here. This won't contain much information to begin with as the debug symbols are missing but can still give some pointers. For next step we might need to get a debug build.
@JackAKirk there is no indication of missing pcie atomics from the logs as far as I can see.
1. Can you try to call printf from the kernel. Does that show the same problem? 2. Can you run the test under a debugger. When it crashes please get the backtrace and post it here. This won't contain much information to begin with as the debug symbols are missing but can still give some pointers. For next step we might need to get a debug build.
Do you have testing for printf/kernel asserts on w6800? Does it work for you?
@JackAKirk printf is part of unit tests https://github.com/ROCm/hip-tests/tree/develop/catch/unit/printf and these are quite well tested on gfx1030. Although I do not know the frequency each particular card is being used and this changes over time, w6800 is on the list of officially supported hardware https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html so I do believe the function has been well tested there. For me both printf/device_assert work on a RX 6900 XT (also gfx1030).
@JackAKirk printf is part of unit tests https://github.com/ROCm/hip-tests/tree/develop/catch/unit/printf and these are quite well tested on gfx1030. Although I do not know the frequency each particular card is being used and this changes over time, w6800 is on the list of officially supported hardware https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html so I do believe the function has been well tested there. For me both printf/device_assert work on a RX 6900 XT (also gfx1030).
Thanks for the info. I've ran hip_tests on the w6800. The ones with printf in their name:
Start 1067: Unit_printf_flags
1067/1268 Test #1067: Unit_printf_flags ............................................................ Passed 0.19 sec
Start 1068: Unit_printf_specifier
1068/1268 Test #1068: Unit_printf_specifier ........................................................ Passed 0.18 sec
The only failing tests are:
The following tests FAILED:
1060 - Unit_hipIpcOpenMemHandle_Negative_Open_In_Two_Contexts_Same_Device (Failed)
1061 - Unit_hipIpcGetMemHandle_Positive_Unique_Handles_Separate_Allocations (Failed)
1062 - Unit_hipIpcGetMemHandle_Positive_Unique_Handles_Reused_Memory (Failed)
1065 - Unit_hipIpcCloseMemHandle_Positive_Reference_Counting (Failed)
1166 - Unit_hipIpcMemAccess_Semaphores (Timeout)
1167 - Unit_hipIpcMemAccess_ParameterValidation (Failed)
and there are no printf tests are not in the set that are skipped.
Hi @JackAKirk so to confirm, If you replace assert() with printf() in the vectoradd test does it go through or it still fail with a bus error.
Hi @JackAKirk so to confirm, If you replace assert() with printf() in the vectoradd test does it go through or it still fail with a bus error. Yeah if I replace
assert(0);withprintf ("hi");I also get "Bus error"
Hi @JackAKirk but at the same time the printf unit tests pass, which is interesting. I would recommend we start with the printf unit test as a reference. The code of the test is the following you can try to just compile as a standalone outside unit tests using the same flags as the failing test.
#include <hip/hip_runtime.h>
__global__ void test_kernel() {
printf("%08d\n", 42);
printf("%08i\n", -42);
printf("%08u\n", 42);
printf("%08g\n", 123.456);
printf("%0+8d\n", 42);
printf("%+d\n", -42);
printf("%+08d\n", 42);
printf("%-8s\n", "xyzzy");
printf("% i\n", -42);
printf("%-16.8d\n", 42);
printf("%16.8d\n", 42);
}
int main() {
test_kernel<<<1, 1>>>();
static_cast<void>(hipDeviceSynchronize());
}
If it works as a standalone, you can then try to strip down the failing test to match this. For example, the unit test uses a hipDeviceSynchronize after the kernel but the vectorAdd test does not. Does the vectorAdd test still fail with hipDeviceSynchronize. The unit tests launches 1 thread, try to do the same in vectorAdd etc. In this way we can likely narrow it down.
Hi @JackAKirk but at the same time the printf unit tests pass, which is interesting. I would recommend we start with the printf unit test as a reference. The code of the test is the following you can try to just compile as a standalone outside unit tests using the same flags as the failing test.
#include <hip/hip_runtime.h> __global__ void test_kernel() { printf("%08d\n", 42); printf("%08i\n", -42); printf("%08u\n", 42); printf("%08g\n", 123.456); printf("%0+8d\n", 42); printf("%+d\n", -42); printf("%+08d\n", 42); printf("%-8s\n", "xyzzy"); printf("% i\n", -42); printf("%-16.8d\n", 42); printf("%16.8d\n", 42); } int main() { test_kernel<<<1, 1>>>(); static_cast<void>(hipDeviceSynchronize()); }If it works as a standalone, you can then try to strip down the failing test to match this. For example, the unit test uses a hipDeviceSynchronize after the kernel but the vectorAdd test does not. Does the vectorAdd test still fail with hipDeviceSynchronize. The unit tests launches 1 thread, try to do the same in vectorAdd etc. In this way we can likely narrow it down.
Hi @iassiour Your example with printf passes. However if I add an assert it hangs. Can you try this to reproduce it:
#include <hip/hip_runtime.h>
__global__ void test_kernel() {
assert(0);
}
int main() {
test_kernel<<<1, 1>>>();
static_cast<void>(hipDeviceSynchronize());
}
However if I comment out
//static_cast<void>(hipDeviceSynchronize());
it doesn't hang. So the problem seems to be calling a device sync following an assert. Do your unit tests cover this?
Hi @JackAKirk the example with the assert works for me. I think that with the current implementation both the assert and printf require a synchronization on the host side before exiting (either hipDeviceSynchronize or implicitly with a blocking call like hipMemCpy) otherwise the program exits too soon. i.e removing hipDeviceSynchronize() I think it just hides the issue.
I would suggest to focus on printf testing only for the time being. In that case we know that a test succeeds and a test fails with the bus error. Can we narrow down what is different in the failing test that causes the error? This may shed some light on what happens with assert as well.
Hi @JackAKirk the example with the assert works for me. I think that with the current implementation both the assert and printf require a synchronization on the host side before exiting (either hipDeviceSynchronize or implicitly with a blocking call like hipMemCpy) otherwise the program exits too soon. i.e removing hipDeviceSynchronize() I think it just hides the issue.
I would suggest to focus on printf testing only for the time being. In that case we know that a test succeeds and a test fails with the bus error. Can we narrow down what is different in the failing test that causes the error? This may shed some light on what happens with assert as well.
Do you think that you could arrange for someone to test this on a w6800, to check whether you reproduce the hanging issue with assert?
Hi @JackAKirk I managed to reproduce the hanging issue with assert on a w6800 machine on windows. I will create an internal ticket to investigate the issue and will come back as soon as I have more details.