HSA_STATUS_ERROR_OUT_OF_RESOURCES on AMD Instinct MI250X when doing allocations in a loop
Questionnaire
- Does ROCm works for you outside of Julia, e.g. C/C++/Python?
yes
- Post output of
rocminfo.
output of `rocminfo`
ROCk module version 6.3.6 is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD EPYC 7A53 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7A53 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size:
- Post output of
AMDGPU.versioninfo()if possible.
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name │ Version │ Path │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ + │ LLD │ - │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/llvm/bin/ld.lld │
│ + │ Device Libraries │ - │ /tmp/julia-depot-FlowMatching-barthale/artifacts/b46ab46ef568406312e5f500efb677511199c2f9/amdgcn/bitcode │
│ + │ HIP │ 6.2.41134 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libamdhip64.so │
│ + │ rocBLAS │ 4.2.1 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocblas.so │
│ + │ rocSOLVER │ 3.26.0 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsolver.so │
│ + │ rocSPARSE │ 3.2.0 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsparse.so │
│ + │ rocRAND │ 2.10.5 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocrand.so │
│ + │ rocFFT │ 1.0.29 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocfft.so │
│ + │ MIOpen │ 3.2.0 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libMIOpen.so │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │ Name │ GCN arch │ Wavefront │ Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│ 1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘
Reproducing the bug
- Describe what's not working.
I am trying to train a neural network and make inference using this neural network with Lux (1.22.1)
AMDGPU (2.1.1) and julia (1.12.0). But I am getting either the error HSA_STATUS_ERROR_OUT_OF_RESOURCES or Failed to successfully execute function and free resources for it. Reporting current memory usage: HIP pool used...
Ref: https://discourse.julialang.org/t/gpu-memory-issue-on-amdgpu/133560
- Provide MWE to reproduce it (if possible).
Here is a reproducer for the error HSA_STATUS_ERROR_OUT_OF_RESOURCES where I allocate arrays of random size:
using AMDGPU
function mytest(N)
total = 0f0;
for i = 1:N
total += sum(AMDGPU.ones(Float32,ntuple(i -> rand(1:250),4)))
end
return total
end
mytest(10_000)
The error message is:
:0:rocdevice.cpp :2982: 3734245763676 us: [pid:38494 tid:0x1498153ff700] Callback:
Queue 0x149815000000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime
failed to allocate the necessary resources. This error may also occur when the core runtime library needs to
spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 0 MB
I do not have any no local setting. With the following settings in LocalPreferences.toml:
[AMDGPU]
hard_memory_limit = "80 %"
eager_gc = true
I have the error:
ERROR: Failed to successfully execute function and free resources for it.
Reporting current memory usage:
- HIP pool used: 393.358 MiB.
- HIP pool reserved: 393.358 MiB.
- Hard memory limit: 51.188 GiB.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:44
[2] alloc_or_retry!(f::AMDGPU.Runtime.Mem.var"#5#6"{HIPStream, Int64, Base.RefValue{Ptr{Nothing}}}, isfailed::typeof(isnothing); stream::HIPStream)
@ AMDGPU.Runtime.Mem /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:34
[3] alloc_or_retry!
@ /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:1 [inlined]
[4] AMDGPU.Runtime.Mem.HIPBuffer(bytesize::Int64; stream::HIPStream)
Maybe the later error is the same as https://github.com/ROCm/hip/issues/3422#issuecomment-2408574367.
The issue also persists with the current version of AMDGPU:
(examples) pkg> st AMDGPU
Status `/pfs/lustrep4/users/barthale/.julia/dev/FlowMatching/examples/Project.toml`
[21141c5a] AMDGPU v2.1.2
I can also reproduce the error with fixed sized array, but I have to run the reproducer with N = 20_000 (or rather N = 40_000) to trigger it reliably:
julia> using AMDGPU; function mytest(N)
total = 0f0;
for i = 1:N
total += sum(AMDGPU.ones(Float32,128,128,128))
end
return total
end; mytest(20_000)
:0:rocdevice.cpp :2982: 3736585454851 us: [pid:74765 tid:0x154e247ff700] Callback:
Queue 0x154e24400000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES:
The runtime failed to allocate the necessary resources. This error may also occur when the core
runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008
Available Free mem : 0 MB
[74765] signal 6 (-6): Aborted
in expression starting at REPL[1]:1
I tested several versions of AMDGPU.jl with julia 1.12.0 (with N = 40_000) and it seems that this issue occurs only is version 2.x of AMDGPU.jl.
| AMDGPU version | Status or error message |
|---|---|
| 2.1.2 | HSA_STATUS_ERROR_OUT_OF_RESOURCES |
| 2.1.1 | HSA_STATUS_ERROR_OUT_OF_RESOURCES |
| 2.1.0 | HSA_STATUS_ERROR_OUT_OF_RESOURCES |
| 2.0.1 | HSA_STATUS_ERROR_OUT_OF_RESOURCES |
| 2.0.0 | HSA_STATUS_ERROR_OUT_OF_RESOURCES |
| 1.3.6 | success! |
| 1.3.5 | LoadError: could not load symbol "hipDeviceGet" |
| 1.3.4 | success! |
| 1.3.3 | success! |
| 1.3.2 | Invalid attribute group entry (Producer: 'LLVM19.0.0git' Reader: 'LLVM 18.1.7jl') |
| 1.3.1 | Invalid attribute group entry (Producer: 'LLVM19.0.0git' Reader: 'LLVM 18.1.7jl') |
| 1.3.1 | Invalid attribute group entry (Producer: 'LLVM19.0.0git' Reader: 'LLVM 18.1.7jl') |
| 1.2.8 | success! |
| 1.2.7 | success! |
| 1.2.6 | Unsatisfiable requirements detected for package GPUCompiler |
| 1.2.5 | success! |
All tests use ROCm 6.2.2 and no local settings (i.e. default GC "eagerness")
Thanks for reporting. This all uses the same version of ROCm?
Could this be due to #806? Does this still happen if you set AMDGPU.EAGER_GC[] = true?
@luraess yes I use the same version of ROCm (6.2.2); I added the info now.
@simeonschaub for the array with random sizes I also tested with:
[AMDGPU]
hard_memory_limit = "80 %"
eager_gc = true
The error is then:
ERROR: Failed to successfully execute function and free resources for it.
Reporting current memory usage:
- HIP pool used: 393.358 MiB.
- HIP pool reserved: 393.358 MiB.
- Hard memory limit: 51.188 GiB.
But I will also try the reproducer with fixed size arrays.
@simeonschaub Indeed, just using AMDGPU.EAGER_GC[] = true and AMDGPU.jl 2.1.2 does not trigger the error with fixed sized arrays in 40_000 iterations.
But I have noticed that my long running training run still fails 3 out of 15 times (still running) with HSA_STATUS_ERROR_OUT_OF_RESOURCES with version 1.3.6 of AMDGPU.jl. I will try the reproducer with more iteration.
But I have noticed that my long running training run still fails 3 out of 15 times (still running) with HSA_STATUS_ERROR_OUT_OF_RESOURCES with version 1.3.6 of AMDGPU.jl
How does your training behave with using AMDGPU.EAGER_GC[] = true and AMDGPU.jl 2.1.2?
@luraess , indeed my training run succeeds with AMDGPU.EAGER_GC[] = true and AMDGPU.jl 2.1.2 (all 15 out of 15)!
Also the stress test with randomly sized or fixed sized arrays with N = 200_000 passes.
I assume that AMDGPU.EAGER_GC[] = true is the same as setting eager_gc = true in the local preferences.
But I always tried the eager GC in combination with hard_memory_limit = "80 %", which, as far as I know, triggers the issue https://github.com/ROCm/hip/issues/3422#issuecomment-2408574367
Maybe there should be a warning in the docs concerning the hard_memory_limit option ? Let me know if a PR for the docs is is helpful.
I am wondering if the eager GC should become the default. Are there performance penalties to be expected?
Maybe there should be a warning in the docs concerning the hard_memory_limit option ? Let me know if a PR for the docs is is helpful.
That would be welcome.
I am wondering if the eager GC should become the default. Are there performance penalties to be expected?
Should be related to https://github.com/JuliaGPU/AMDGPU.jl/pull/806 so I would not change this for now.
Actually, even with AMDGPU.EAGER_GC[] = true I get this error when using 8 GPUs in parallel after 200 epochs for my main usecase (training a neural network):
:0:rocdevice.cpp :2982: 13499100493533 us: [pid:97767 tid:0x150bfc7ff700] Callback: Queue 0x150bf3a00000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime
failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Code: 0x1008 Availabl
e Free mem : 0 MB
[97767] signal 6 (-6): Aborted
in expression starting at /pfs/lustrep4/users/barthale/.julia/dev/FlowMatching/examples/flow_matching_vel.jl:242
I am wondering if AMDGPU.EAGER_GC[] = true is rather reducing the probability of occurrence.
I am not using more RAM or vRAM in the parallel case than in the serial case.
What do you mean by parallel case? How does this differ from the single GPU (or serial) case?
I just clarified that this error was triggered in my main use case (training a neural network) not with the reproducer.
The difference between the parallel case is that the MPI processes needed to wait on each other to combine the computed gradients (the approach is called Distributed Data Parallel). The MPI implementation is ROCm-aware. I assume that there is not additional copy of the data made by the MPI layer, but I don't know if there are strictly no allocation at all.
Are you using one or two GCD's per module (GPU) in the parallel case? From previous experience on LUMI and MI250x, I recall that some resources may actually and surprisingly be shared per module amongst GCDs (which may increase the memory pressure?)
Yes, in my tests I used 8 GCD of a single LUMI-G node. So indeed I am using all two GCDs for every 4 GPUs.
For what it is worth I made some simple benchmarks here (just a single convolution, adding a bias and a ReLU):
https://github.com/Alexander-Barth/lumi-lux-mem-issues/tree/main/benchmark
The runtime can jump from 41ms to 71ms (for a tensor of the size 256 x 256 x 64 x 128) if I do not reserve all two GCDs which is consistent with the resource sharing between GCDs that you mention.
Using the eager GC did not seem to affect the runtime for this benchmark.
Interesting, thanks for sharing. Could you check whether your main use case would also error when running in parallel but on only one GCD per module (but still reserving all GCDs - i.e. on 2 nodes for 8 ranks)?
yes, I was wondering the same thing :-) I tried with the option --exclusive:
sbatch --exclusive --partition=small-g --time=48:00:00 --mem=120G --cpus-per-task=1 --ntasks=4 --nodes=1 --gpus=4 ...
I have still the same error. But I am not sure how to be certain that the 4 GCDs are "spread out" and not cramped into 2 GPUs.
Maybe I need to allocate 8 GPUs and set ROCR_VISIBLE_DEVICES=0,2,4,6 ?
Maybe I need to allocate 8 GPUs and set ROCR_VISIBLE_DEVICES=0,2,4,6 ?
I guess this way should work, yes
Sadly, the error persists with ROCR_VISIBLE_DEVICES=0,2,4,6 (after ~250 epochs of training this time).