Writing tuple to array doesn't work with bounds check
I don't have a better title for this. Original example by @LasNikas.
using KernelAbstractions
using Adapt
using Metal
@kernel function mykernel!(a)
i = @index(Global)
t = (1, 2, 3)
for k in 1:2
a[k, i] = t[k]
end
end
backend = MetalBackend()
n = 10
a = Adapt.adapt(backend, zeros(Float32, 2, n))
mykernel!(backend)(a, ndrange=n)
The kernel doesn't seem to do anything:
julia> a
2×10 MtlMatrix{Float32, Metal.PrivateStorage}:
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
This happens both on v1.5.1 and on main:
julia> Metal.versioninfo()
macOS 15.3.1, Darwin 24.3.0
Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6
Julia packages:
- Metal.jl: 1.5.1
- GPUArrays: 11.2.2
- GPUCompiler: 1.5.0
- KernelAbstractions: 0.9.34
- ObjectiveC: 3.4.1
- LLVM: 9.4.0
- LLVMDowngrader_jll: 0.6.0+0
1 device:
- Apple M2 Pro (384.000 KiB allocated)
On CUDA, this works as expected:
julia> a
2×10 CuArray{Float32, 2, CUDA.DeviceMemory}:
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
Same with backend = CPU().
MWE without KA:
using Metal
n = 10
a = Metal.zeros(Float32, 2, n)
function mykernel!(a)
i = thread_position_in_grid_1d()
t = (1, 2, 3)
for k in 1:2
a[k, i] = t[k]
end
end
@metal threads=10 mykernel!(a)
@efaulhaber I don't have the technical knowledge to immediately understand what's wrong, but I seem to be able to get it working by adding @inbounds before the for loop in both the KA and Metal MWEs.
Does it work for you?
Do you mean this?
@inbounds for k in 1:2
a[k, i] = t[k]
end
That doesn't work for me. Neither with KA nor with your pure Metal example. @LasNikas reported the same for his original example, which I reduced to this MWE on an M4 (Pro?) I think. Which CPU do you use?
I have an M2 Max, but I'd be surprised if that were the issue.
Can you share the output of the following code? You should be able to copy-paste into one prompt after adding the required packages in your environment.
using KernelAbstractions, Adapt, Metal; @kernel function brokenkernel!(a)
i = @index(Global)
t = (1, 2, 3)
for k in 1:2
a[k, i] = t[k]
end
end; @kernel function workingkernel!(a)
i = @index(Global)
t = (1, 2, 3)
@inbounds for k in 1:2
a[k, i] = t[k]
end
end; begin
Metal.versioninfo()
println()
backend = MetalBackend()
n = 10
broken = Adapt.adapt(backend, zeros(Float32, 2, n))
@show broken
brokenkernel!(backend)(broken, ndrange=n)
@show broken
working = Adapt.adapt(backend, zeros(Float32, 2, n))
@show working
workingkernel!(backend)(working, ndrange=n)
@show working;
end;
My output:
macOS 14.7.4, Darwin 23.6.0
Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6
Julia packages:
- Metal.jl: 1.5.1
- GPUArrays: 11.2.2
- GPUCompiler: 1.5.0
- KernelAbstractions: 0.9.34
- ObjectiveC: 3.4.1
- LLVM: 9.4.0
- LLVMDowngrader_jll: 0.6.0+0
1 device:
- Apple M2 Max (384.000 KiB allocated)
broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0]
@inbounds works for me.
And I get the following output for @christiangnrd code:
macOS 15.4.1, Darwin 24.4.0
Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6
Julia packages:
- Metal.jl: 1.5.1
- GPUArrays: 11.2.2
- GPUCompiler: 1.4.1
- KernelAbstractions: 0.9.34
- ObjectiveC: 3.4.1
- LLVM: 9.3.1
- LLVMDowngrader_jll: 0.6.0+0
1 device:
- Apple M4 Pro (464.000 KiB allocated)
broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0]
macOS 15.3.1, Darwin 24.3.0
Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6
Julia packages:
- Metal.jl: 1.5.1
- GPUArrays: 11.2.2
- GPUCompiler: 1.5.0
- KernelAbstractions: 0.9.34
- ObjectiveC: 3.4.1
- LLVM: 9.4.0
- LLVMDowngrader_jll: 0.6.0+0
1 device:
- Apple M2 Pro (64.000 KiB allocated)
broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
@efaulhaber We now have 2 separate issues: the first being that with bounds checking, the code doesn't work, and the second being that without bounds checking, the code seems to be broken just for you.
For the first issue, that can be worked around by using @inbounds until a proper fix (if it exists) can be made.
For the second issue, can you provide the output of
@device_code_llvm mykernel!(backend)(a, ndrange=n)
for both with and without the @inbounds?
Edit: Are you also launching Julia with --check-bounds=yes? When I do that it also does not work even with@inbounds (since it was told to ignore it)
Arrgh, I was using the vscode REPL, which I configured to always use --check-bounds=yes for development. It works now with @inbounds.
Super nasty bug if things stop working with --check-bounds=yes.