HIP
HIP copied to clipboard
Exporting `HIP_VISIBLE_DEVICES=<empty>` does not disable devices
The problem
Setting CUDA_VISIBLE_DEVICES= disables devices. HIP_VISIBLE_DEVICES should exhibit the same behavior.
Repro:
// code.c
#include <hip/hip_runtime.h>
#include <stdio.h>
void show_hip_errors(const char func_name[], hipError_t ret)
{
printf("%s() -> error: %d (%s): %s\n", func_name, (int)ret, hipGetErrorName(ret), hipGetErrorString(ret));
}
int main()
{
int count = 0;
show_hip_errors("hipGetDeviceCount", hipGetDeviceCount(&count));
printf("device count: %d\n", count);
show_hip_errors("hipInit", hipInit(0));
}
compile and run with the following scenarios:
- Current behavior matches expected? YES✅
$ ./code
hipGetDeviceCount() -> error: 0 (hipSuccess): hipSuccess
device count: 1
hipInit() -> error: 0 (hipSuccess): hipSuccess
- Current behavior matches expected? YES✅
$ HIP_VISIBLE_DEVICES=0 ./code
hipGetDeviceCount() -> error: 0 (hipSuccess): hipSuccess
device count: 1
hipInit() -> error: 0 (hipSuccess): hipSuccess
- Current behavior matches expected? NO❌
$ HIP_VISIBLE_DEVICES= ./code
hipGetDeviceCount() -> error: 0 (hipSuccess): hipSuccess
device count: 1
hipInit() -> error: 0 (hipSuccess): hipSuccess
expected:
$ HIP_VISIBLE_DEVICES= ./code
hipGetDeviceCount() -> error: 38 (hipErrorNoDevice): No HIP-capable device found
device count: 0
hipInit() -> error: 8675309 (hipSomeOtherError): lorem ipsum dolor sit amet
Version info
$ hipcc --version
HIP version: 5.2.21153-02187ecf
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.3 22324 d6c88e5a78066d5d7a1e8db6c5e3e9884c6ad10e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.2.3/llvm/bin
Thanks for reporting the issue. We are checking it internally.
Any update on this?
This issue has been fixed internally. It might take a few more days to appear in the github develop branch.
Fix is present in the github develop branch. Please verify and close. Thanks.
can @jedbrown or @jczhang07 confirm? I do not have access to a HIP machine.
I've only tested on rocm-5.3.3, which still has the issue.
On a Crusher compute, with export HIP_VISIBLE_DEVICES=, I could still run petsc GPU tests. I used rocm/5.2.0.
@Jacobfaib Hi, were you able to resolve this issue on the latest HIP? If so can we close this ticket?
Of the versions I've tested, versions up to and including 5.4.3 have incorrect behavior. It looks to be fixed in 5.5.1 and later. (I didn't test every subversion, so it was probably fixed in 5.5.0.)