AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

HSA_STATUS_ERROR_OUT_OF_RESOURCES on AMD Instinct MI250X when doing allocations in a loop

Open Alexander-Barth opened this issue 2 months ago • 18 comments

Questionnaire

  1. Does ROCm works for you outside of Julia, e.g. C/C++/Python?

yes

  1. Post output of rocminfo.
output of `rocminfo`
ROCk module version 6.3.6 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:  
  1. Post output of AMDGPU.versioninfo() if possible.
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                                                     │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/llvm/bin/ld.lld                                             │
│     +     │ Device Libraries │ -         │ /tmp/julia-depot-FlowMatching-barthale/artifacts/b46ab46ef568406312e5f500efb677511199c2f9/amdgcn/bitcode │
│     +     │ HIP              │ 6.2.41134 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libamdhip64.so                                              │
│     +     │ rocBLAS          │ 4.2.1     │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocblas.so                                               │
│     +     │ rocSOLVER        │ 3.26.0    │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsolver.so                                             │
│     +     │ rocSPARSE        │ 3.2.0     │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsparse.so                                             │
│     +     │ rocRAND          │ 2.10.5    │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocrand.so                                               │
│     +     │ rocFFT           │ 1.0.29    │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocfft.so                                                │
│     +     │ MIOpen           │ 3.2.0     │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libMIOpen.so                                                │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │                Name │               GCN arch │ Wavefront │     Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│  1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘

Reproducing the bug

  1. Describe what's not working.

I am trying to train a neural network and make inference using this neural network with Lux (1.22.1) AMDGPU (2.1.1) and julia (1.12.0). But I am getting either the error HSA_STATUS_ERROR_OUT_OF_RESOURCES or Failed to successfully execute function and free resources for it. Reporting current memory usage: HIP pool used... Ref: https://discourse.julialang.org/t/gpu-memory-issue-on-amdgpu/133560

  1. Provide MWE to reproduce it (if possible).

Here is a reproducer for the error HSA_STATUS_ERROR_OUT_OF_RESOURCES where I allocate arrays of random size:

using AMDGPU
function mytest(N)
  total = 0f0; 
  for i = 1:N
    total += sum(AMDGPU.ones(Float32,ntuple(i -> rand(1:250),4)))
  end
  return total
end
mytest(10_000)

The error message is:

:0:rocdevice.cpp            :2982: 3734245763676 us: [pid:38494 tid:0x1498153ff700] Callback: 
Queue 0x149815000000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime 
failed to allocate the necessary resources. This error may also occur when the core runtime library needs to 
spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 0 MB

I do not have any no local setting. With the following settings in LocalPreferences.toml:

[AMDGPU]
hard_memory_limit = "80 %"
eager_gc = true

I have the error:

ERROR: Failed to successfully execute function and free resources for it.
Reporting current memory usage:
- HIP pool used: 393.358 MiB.
- HIP pool reserved: 393.358 MiB.
- Hard memory limit: 51.188 GiB.

Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:44
  [2] alloc_or_retry!(f::AMDGPU.Runtime.Mem.var"#5#6"{HIPStream, Int64, Base.RefValue{Ptr{Nothing}}}, isfailed::typeof(isnothing); stream::HIPStream)
    @ AMDGPU.Runtime.Mem /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:34
  [3] alloc_or_retry!
    @ /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:1 [inlined]
  [4] AMDGPU.Runtime.Mem.HIPBuffer(bytesize::Int64; stream::HIPStream)

Maybe the later error is the same as https://github.com/ROCm/hip/issues/3422#issuecomment-2408574367.

The issue also persists with the current version of AMDGPU:

(examples) pkg> st AMDGPU
Status `/pfs/lustrep4/users/barthale/.julia/dev/FlowMatching/examples/Project.toml`
  [21141c5a] AMDGPU v2.1.2

Alexander-Barth avatar Nov 07 '25 17:11 Alexander-Barth

I can also reproduce the error with fixed sized array, but I have to run the reproducer with N = 20_000 (or rather N = 40_000) to trigger it reliably:

julia> using AMDGPU; function mytest(N)
                      total = 0f0; 
                      for i = 1:N
                        total += sum(AMDGPU.ones(Float32,128,128,128))
                      end
                      return total
                    end; mytest(20_000)
:0:rocdevice.cpp            :2982: 3736585454851 us: [pid:74765 tid:0x154e247ff700] Callback: 
Queue 0x154e24400000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: 
The runtime failed to allocate the necessary resources. This error may also occur when the core 
runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 
Available Free mem : 0 MB

[74765] signal 6 (-6): Aborted
in expression starting at REPL[1]:1

Alexander-Barth avatar Nov 07 '25 17:11 Alexander-Barth

I tested several versions of AMDGPU.jl with julia 1.12.0 (with N = 40_000) and it seems that this issue occurs only is version 2.x of AMDGPU.jl.

AMDGPU version Status or error message
2.1.2 HSA_STATUS_ERROR_OUT_OF_RESOURCES
2.1.1 HSA_STATUS_ERROR_OUT_OF_RESOURCES
2.1.0 HSA_STATUS_ERROR_OUT_OF_RESOURCES
2.0.1 HSA_STATUS_ERROR_OUT_OF_RESOURCES
2.0.0 HSA_STATUS_ERROR_OUT_OF_RESOURCES
1.3.6 success!
1.3.5 LoadError: could not load symbol "hipDeviceGet"
1.3.4 success!
1.3.3 success!
1.3.2 Invalid attribute group entry (Producer: 'LLVM19.0.0git' Reader: 'LLVM 18.1.7jl')
1.3.1 Invalid attribute group entry (Producer: 'LLVM19.0.0git' Reader: 'LLVM 18.1.7jl')
1.3.1 Invalid attribute group entry (Producer: 'LLVM19.0.0git' Reader: 'LLVM 18.1.7jl')
1.2.8 success!
1.2.7 success!
1.2.6 Unsatisfiable requirements detected for package GPUCompiler
1.2.5 success!

All tests use ROCm 6.2.2 and no local settings (i.e. default GC "eagerness")

Alexander-Barth avatar Nov 10 '25 14:11 Alexander-Barth

Thanks for reporting. This all uses the same version of ROCm?

luraess avatar Nov 10 '25 14:11 luraess

Could this be due to #806? Does this still happen if you set AMDGPU.EAGER_GC[] = true?

simeonschaub avatar Nov 10 '25 14:11 simeonschaub

@luraess yes I use the same version of ROCm (6.2.2); I added the info now.

@simeonschaub for the array with random sizes I also tested with:

[AMDGPU]
hard_memory_limit = "80 %"
eager_gc = true

The error is then:

ERROR: Failed to successfully execute function and free resources for it.
Reporting current memory usage:
- HIP pool used: 393.358 MiB.
- HIP pool reserved: 393.358 MiB.
- Hard memory limit: 51.188 GiB.

But I will also try the reproducer with fixed size arrays.

Alexander-Barth avatar Nov 10 '25 16:11 Alexander-Barth

@simeonschaub Indeed, just using AMDGPU.EAGER_GC[] = true and AMDGPU.jl 2.1.2 does not trigger the error with fixed sized arrays in 40_000 iterations.

But I have noticed that my long running training run still fails 3 out of 15 times (still running) with HSA_STATUS_ERROR_OUT_OF_RESOURCES with version 1.3.6 of AMDGPU.jl. I will try the reproducer with more iteration.

Alexander-Barth avatar Nov 11 '25 08:11 Alexander-Barth

But I have noticed that my long running training run still fails 3 out of 15 times (still running) with HSA_STATUS_ERROR_OUT_OF_RESOURCES with version 1.3.6 of AMDGPU.jl

How does your training behave with using AMDGPU.EAGER_GC[] = true and AMDGPU.jl 2.1.2?

luraess avatar Nov 11 '25 08:11 luraess

@luraess , indeed my training run succeeds with AMDGPU.EAGER_GC[] = true and AMDGPU.jl 2.1.2 (all 15 out of 15)!

Also the stress test with randomly sized or fixed sized arrays with N = 200_000 passes.

I assume that AMDGPU.EAGER_GC[] = true is the same as setting eager_gc = true in the local preferences. But I always tried the eager GC in combination with hard_memory_limit = "80 %", which, as far as I know, triggers the issue https://github.com/ROCm/hip/issues/3422#issuecomment-2408574367

Maybe there should be a warning in the docs concerning the hard_memory_limit option ? Let me know if a PR for the docs is is helpful.

I am wondering if the eager GC should become the default. Are there performance penalties to be expected?

Alexander-Barth avatar Nov 14 '25 07:11 Alexander-Barth

Maybe there should be a warning in the docs concerning the hard_memory_limit option ? Let me know if a PR for the docs is is helpful.

That would be welcome.

I am wondering if the eager GC should become the default. Are there performance penalties to be expected?

Should be related to https://github.com/JuliaGPU/AMDGPU.jl/pull/806 so I would not change this for now.

luraess avatar Nov 17 '25 10:11 luraess

Actually, even with AMDGPU.EAGER_GC[] = true I get this error when using 8 GPUs in parallel after 200 epochs for my main usecase (training a neural network):

:0:rocdevice.cpp            :2982: 13499100493533 us: [pid:97767 tid:0x150bfc7ff700] Callback: Queue 0x150bf3a00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime
 failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Availabl
e Free mem : 0 MB

[97767] signal 6 (-6): Aborted
in expression starting at /pfs/lustrep4/users/barthale/.julia/dev/FlowMatching/examples/flow_matching_vel.jl:242

I am wondering if AMDGPU.EAGER_GC[] = true is rather reducing the probability of occurrence.

I am not using more RAM or vRAM in the parallel case than in the serial case.

Alexander-Barth avatar Nov 24 '25 07:11 Alexander-Barth

What do you mean by parallel case? How does this differ from the single GPU (or serial) case?

luraess avatar Nov 24 '25 12:11 luraess

I just clarified that this error was triggered in my main use case (training a neural network) not with the reproducer.

The difference between the parallel case is that the MPI processes needed to wait on each other to combine the computed gradients (the approach is called Distributed Data Parallel). The MPI implementation is ROCm-aware. I assume that there is not additional copy of the data made by the MPI layer, but I don't know if there are strictly no allocation at all.

Alexander-Barth avatar Nov 24 '25 13:11 Alexander-Barth

Are you using one or two GCD's per module (GPU) in the parallel case? From previous experience on LUMI and MI250x, I recall that some resources may actually and surprisingly be shared per module amongst GCDs (which may increase the memory pressure?)

luraess avatar Nov 24 '25 23:11 luraess

Yes, in my tests I used 8 GCD of a single LUMI-G node. So indeed I am using all two GCDs for every 4 GPUs.

For what it is worth I made some simple benchmarks here (just a single convolution, adding a bias and a ReLU):

https://github.com/Alexander-Barth/lumi-lux-mem-issues/tree/main/benchmark

The runtime can jump from 41ms to 71ms (for a tensor of the size 256 x 256 x 64 x 128) if I do not reserve all two GCDs which is consistent with the resource sharing between GCDs that you mention.

Using the eager GC did not seem to affect the runtime for this benchmark.

Alexander-Barth avatar Nov 25 '25 09:11 Alexander-Barth

Interesting, thanks for sharing. Could you check whether your main use case would also error when running in parallel but on only one GCD per module (but still reserving all GCDs - i.e. on 2 nodes for 8 ranks)?

luraess avatar Nov 25 '25 11:11 luraess

yes, I was wondering the same thing :-) I tried with the option --exclusive:

sbatch --exclusive  --partition=small-g --time=48:00:00   --mem=120G --cpus-per-task=1  --ntasks=4 --nodes=1 --gpus=4 ...

I have still the same error. But I am not sure how to be certain that the 4 GCDs are "spread out" and not cramped into 2 GPUs. Maybe I need to allocate 8 GPUs and set ROCR_VISIBLE_DEVICES=0,2,4,6 ?

Alexander-Barth avatar Nov 25 '25 13:11 Alexander-Barth

Maybe I need to allocate 8 GPUs and set ROCR_VISIBLE_DEVICES=0,2,4,6 ?

I guess this way should work, yes

luraess avatar Nov 25 '25 13:11 luraess

Sadly, the error persists with ROCR_VISIBLE_DEVICES=0,2,4,6 (after ~250 epochs of training this time).

Alexander-Barth avatar Nov 26 '25 15:11 Alexander-Barth