Anton Smirnov
Anton Smirnov
You can also access size of the array within the kernel and compute `i, j` indices from there. And if you do only element-wise operations, you can just index with...
The original error looks like: https://github.com/ROCm/clr/issues/36 I've seen this with debug ROCm build.
~~@vchuravy not sure about CPU errors (regarding `@index(Local)`). Any idea?~~ UPD: https://github.com/JuliaGPU/KernelAbstractions.jl/issues/218#issuecomment-783486593
https://github.com/JuliaGPU/AMDGPU.jl/pull/729 was closed, because from my testing I didn't see a major performance improvement from the warp reduce and in some cases (like [fused softmax](https://github.com/pxl-th/NNop.jl/blob/master/src/softmax.jl)) it was actually slower. So...
> Apologies if my assessment is not entirely accurate as I am not intimately familiar with all the internal intricacies of `KernelAbstractions.jl`. I am implementing reductions for a different project...
There are some issues with multi-gpu setup, not sure if this is the one as well: https://github.com/JuliaGPU/AMDGPU.jl/issues/648 You can disabling multi-gpu tests with `HIP_VISIBLE_DEVICES=0` to see if the hangs dissapear....
Without profiling, I suspect this is because the VRAM is not freed in time (because GC does not know about GPU memory space). This creates memory pressure, and when it...
These memory limit parameters only control how soon the GC is triggered **manually** under-the-hood, so it won't help you avoiding GC calls. And [here](https://github.com/JuliaGPU/GPUArrays.jl/pull/550#issuecomment-2225915738) you can see that those GC...
Closing this as we now have caching allocator that does not rely on GC so the allocations/deallocations are very fast: https://juliagpu.github.io/GPUArrays.jl/dev/interface/#Caching-Allocator
@maleadt how does CUDA make array (`cuda_pointer_ret = CuArray(pointer_ret)`) accessible on multiple GPUs in this case?