Valentin Churavy
Valentin Churavy
In contrast with constant a ndrange: ``` │┌ @ /srv/scratch/lraess/julia_depot/packages/KernelAbstractions/zPAn3/src/nditeration.jl:84 within `expand` ; ││┌ @ abstractarray.jl:1291 within `getindex` ; │││┌ @ abstractarray.jl:1336 within `_getindex` ; ││││┌ @ abstractarray.jl:1343 within `_to_subscript_indices`...
I am not sure right now. 1. We could special case 1D/2D/3D NDRanges 2. Maybe https://github.com/maleadt/StaticCartesian.jl would help, but in this case we don't have a static set of cartesian...
`@synchronize` sadly doesn't work on the CPU within arbitrary control-flow see #330
``` for i in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] if li + i
> (call to ijl_alloc_array_1d) ``` for i in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] ``` Use a tuple `()` instead
Hm on the GPU you shouldn't get inconsistent results, on the CPU yes.
So the only thing that seems fishy to me is. ``` while (li + s)
You are trying to do something similar to https://github.com/JuliaGPU/CUDA.jl/blob/cb14a637e0b7b7be9ae01005ea9bdcf79b320189/src/mapreduce.jl#L62 IIUC? My long term goal is to provide primitives for these kinds of block reduces #234 So the difference I see...
Are you back to testing on the CPU? I should add a pass that errors when synchronize is used inside control flow, KA current design simply can't support that.
Are you not missing a `+1` in `lhsIndex = 2 * d * (li - 1)` For `li=1` `lhsindex` will be 0 which is OOB