HIP icon indicating copy to clipboard operation
HIP copied to clipboard

Exporting `HIP_VISIBLE_DEVICES=<empty>` does not disable devices

Open Jacobfaib opened this issue 3 years ago • 9 comments
trafficstars

The problem

Setting CUDA_VISIBLE_DEVICES= disables devices. HIP_VISIBLE_DEVICES should exhibit the same behavior.

Repro:

// code.c
#include <hip/hip_runtime.h>
#include <stdio.h>

void show_hip_errors(const char func_name[], hipError_t ret)
{
  printf("%s() -> error: %d (%s): %s\n", func_name, (int)ret, hipGetErrorName(ret), hipGetErrorString(ret));
}

int main()
{
  int count = 0;
  show_hip_errors("hipGetDeviceCount", hipGetDeviceCount(&count));
  printf("device count: %d\n", count);
  show_hip_errors("hipInit", hipInit(0));
}

compile and run with the following scenarios:

  1. Current behavior matches expected? YES✅
$ ./code
hipGetDeviceCount() -> error: 0 (hipSuccess): hipSuccess
device count: 1
hipInit() -> error: 0 (hipSuccess): hipSuccess
  1. Current behavior matches expected? YES✅
$ HIP_VISIBLE_DEVICES=0 ./code
hipGetDeviceCount() -> error: 0 (hipSuccess): hipSuccess
device count: 1
hipInit() -> error: 0 (hipSuccess): hipSuccess
  1. Current behavior matches expected? NO❌
$ HIP_VISIBLE_DEVICES= ./code
hipGetDeviceCount() -> error: 0 (hipSuccess): hipSuccess
device count: 1
hipInit() -> error: 0 (hipSuccess): hipSuccess

expected:

$ HIP_VISIBLE_DEVICES= ./code
hipGetDeviceCount() -> error: 38 (hipErrorNoDevice): No HIP-capable device found
device count: 0
hipInit() -> error: 8675309 (hipSomeOtherError): lorem ipsum dolor sit amet

Version info

$ hipcc --version
HIP version: 5.2.21153-02187ecf
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.3 22324 d6c88e5a78066d5d7a1e8db6c5e3e9884c6ad10e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.2.3/llvm/bin

Jacobfaib avatar Oct 16 '22 13:10 Jacobfaib

Thanks for reporting the issue. We are checking it internally.

gargrahul avatar Oct 20 '22 04:10 gargrahul

Any update on this?

Jacobfaib avatar Nov 20 '22 15:11 Jacobfaib

This issue has been fixed internally. It might take a few more days to appear in the github develop branch.

satyanveshd avatar Nov 22 '22 18:11 satyanveshd

Fix is present in the github develop branch. Please verify and close. Thanks.

satyanveshd avatar Dec 05 '22 16:12 satyanveshd

can @jedbrown or @jczhang07 confirm? I do not have access to a HIP machine.

Jacobfaib avatar Dec 10 '22 00:12 Jacobfaib

I've only tested on rocm-5.3.3, which still has the issue.

jedbrown avatar Dec 10 '22 02:12 jedbrown

On a Crusher compute, with export HIP_VISIBLE_DEVICES=, I could still run petsc GPU tests. I used rocm/5.2.0.

jczhang07 avatar Dec 10 '22 04:12 jczhang07

@Jacobfaib Hi, were you able to resolve this issue on the latest HIP? If so can we close this ticket?

abhimeda avatar Feb 07 '24 19:02 abhimeda

Of the versions I've tested, versions up to and including 5.4.3 have incorrect behavior. It looks to be fixed in 5.5.1 and later. (I didn't test every subversion, so it was probably fixed in 5.5.0.)

jedbrown avatar Feb 13 '24 03:02 jedbrown