Alexander Barth
Alexander Barth
I tested several versions of AMDGPU.jl with julia 1.12.0 (with N = 40_000) and it seems that this issue occurs only is version 2.x of AMDGPU.jl. | AMDGPU version |...
@luraess yes I use the same version of ROCm (6.2.2); I added the info now. @simeonschaub for the array with random sizes I also tested with: ``` [AMDGPU] hard_memory_limit =...
@simeonschaub Indeed, just using `AMDGPU.EAGER_GC[] = true` and AMDGPU.jl 2.1.2 does not trigger the error with fixed sized arrays in 40_000 iterations. But I have noticed that my long running...
@luraess , indeed my training run succeeds with `AMDGPU.EAGER_GC[] = true` and AMDGPU.jl 2.1.2 (all 15 out of 15)! Also the stress test with randomly sized or fixed sized arrays...
Actually, even with `AMDGPU.EAGER_GC[] = true` I get this error when using 8 GPUs in parallel after 200 epochs for my main usecase (training a neural network): ``` :0:rocdevice.cpp :2982:...
I just clarified that this error was triggered in my main use case (training a neural network) not with the reproducer. The difference between the parallel case is that the...
Yes, in my tests I used 8 GCD of a single LUMI-G node. So indeed I am using all two GCDs for every 4 GPUs. For what it is worth...
yes, I was wondering the same thing :-) I tried with the option --exclusive: ``` sbatch --exclusive --partition=small-g --time=48:00:00 --mem=120G --cpus-per-task=1 --ntasks=4 --nodes=1 --gpus=4 ... ``` I have still the...
Sadly, the error persists with `ROCR_VISIBLE_DEVICES=0,2,4,6` (after ~250 epochs of training this time).
The largest array that I can allocate with `ones` seems to be `4*1024^3-1024`: ``` julia> using AMDGPU; a = AMDGPU.ones(UInt8,4*1024^3-1024); julia> using AMDGPU; a = AMDGPU.ones(UInt8,4*1024^3-1023); ERROR: HIPError(code hipErrorInvalidConfiguration, invalid...