llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL][HIP][E2E] Pre-commit workflow test-suite can't find HIP device

Open Bensuo opened this issue 1 year ago • 5 comments

Pre-commit workflows for PRs are failing when running the E2E test-suite for the HIP backend as it fails to detect a HIP device when starting the tests. First seen affecting #10216 but also seen in other workflow runs on other PRs.

Some links to failed runs:

  • https://github.com/intel/llvm/actions/runs/5693890961/job/15434520480
  • https://github.com/intel/llvm/actions/runs/5693873820/job/15435570348?pr=10216

Bensuo avatar Jul 28 '23 17:07 Bensuo

We are seeing that our AMDGPU runners become unusable sometimes. When it happens I see the following:

# /opt/rocm-4.5.1/bin/rocminfo
ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn thre
ads or create internal OS-specific events.

@intel/llvm-reviewers-cuda , @npmiller , any ideas what might be the reason and if there are any preventive measures we can take (like GPU reset for Intel GPUs maybe)?

aelovikov-intel avatar Jul 28 '23 17:07 aelovikov-intel

I wonder if ROCm may be upgraded to 5.x and there are two AMD GPUs for testing.

jinz2014 avatar Jul 28 '23 18:07 jinz2014

I wonder if ROCm may be upgraded to 5.x

+ @bader

there are two AMD GPUs for testing

What do you mean by that?

aelovikov-intel avatar Jul 28 '23 18:07 aelovikov-intel

I wonder if ROCm may be upgraded to 5.x

  • @bader

@aelovikov-intel, I think @AerialMantis or @npmiller can answer this question.

GPU reset for Intel GPUs maybe

GPU reset is not a preventive measure for issues like this. It helps to recover the GPU state after something bad happed, but it doesn't prevent GPU driver to go out of resources.

bader avatar Jul 28 '23 18:07 bader

We've had the ROCk module is loaded issue happen as well, but we haven't found a good preventative measure for it either, when that happens it usually requires either to reload the kernel module or a reboot.

And yes it's fine to bump to ROCm 5.x, but I believe this was done already, sorry about the delayed reply.

npmiller avatar Sep 28 '23 12:09 npmiller