Anton Smirnov comments

Results 213 comments of


                                            Anton Smirnov

amd gpu give different results when nested loop is used

You can also access size of the array within the kernel and compute `i, j` indices from there. And if you do only element-wise operations, you can just index with...

ROCclr segfault when running Julia with threads

The original error looks like: https://github.com/ROCm/clr/issues/36 I've seen this with debug ROCm build.

Implement groupreduce API

~~@vchuravy not sure about CPU errors (regarding `@index(Local)`). Any idea?~~ UPD: https://github.com/JuliaGPU/KernelAbstractions.jl/issues/218#issuecomment-783486593

https://github.com/JuliaGPU/AMDGPU.jl/pull/729 was closed, because from my testing I didn't see a major performance improvement from the warp reduce and in some cases (like [fused softmax](https://github.com/pxl-th/NNop.jl/blob/master/src/softmax.jl)) it was actually slower. So...

Implement groupreduce API

> Apologies if my assessment is not entirely accurate as I am not intimately familiar with all the internal intricacies of `KernelAbstractions.jl`. I am implementing reductions for a different project...

Tests hang on Windows (RX7900XT)

There are some issues with multi-gpu setup, not sure if this is the one as well: https://github.com/JuliaGPU/AMDGPU.jl/issues/648 You can disabling multi-gpu tests with `HIP_VISIBLE_DEVICES=0` to see if the hangs dissapear....

Flux : slower and slower computations of a neural network - a VRAM problem ?

Without profiling, I suspect this is because the VRAM is not freed in time (because GC does not know about GPU memory space). This creates memory pressure, and when it...

Flux : slower and slower computations of a neural network - a VRAM problem ?

These memory limit parameters only control how soon the GC is triggered **manually** under-the-hood, so it won't help you avoiding GC calls. And [here](https://github.com/JuliaGPU/GPUArrays.jl/pull/550#issuecomment-2225915738) you can see that those GC...

Flux : slower and slower computations of a neural network - a VRAM problem ?

Closing this as we now have caching allocator that does not rely on GC so the allocations/deallocations are very fast: https://juliagpu.github.io/GPUArrays.jl/dev/interface/#Caching-Allocator

Using one single array of pointers for multiGPU AMDGPU computation

@maleadt how does CUDA make array (`cuda_pointer_ret = CuArray(pointer_ret)`) accessible on multiple GPUs in this case?