Metal.jl Writing tuple to array doesn't work with bounds check

I don't have a better title for this. Original example by @LasNikas.

using KernelAbstractions
using Adapt
using Metal

@kernel function mykernel!(a)
    i = @index(Global)

    t = (1, 2, 3)

    for k in 1:2
        a[k, i] = t[k]
    end
end

backend = MetalBackend()
n = 10
a = Adapt.adapt(backend, zeros(Float32, 2, n))

mykernel!(backend)(a, ndrange=n)

The kernel doesn't seem to do anything:

julia> a
2×10 MtlMatrix{Float32, Metal.PrivateStorage}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

This happens both on v1.5.1 and on main:

julia> Metal.versioninfo()
macOS 15.3.1, Darwin 24.3.0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

Julia packages: 
- Metal.jl: 1.5.1
- GPUArrays: 11.2.2
- GPUCompiler: 1.5.0
- KernelAbstractions: 0.9.34
- ObjectiveC: 3.4.1
- LLVM: 9.4.0
- LLVMDowngrader_jll: 0.6.0+0

1 device:
- Apple M2 Pro (384.000 KiB allocated)

On CUDA, this works as expected:

julia> a
2×10 CuArray{Float32, 2, CUDA.DeviceMemory}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0

Same with backend = CPU().

May 19 '25 20:05 efaulhaber

MWE without KA:

using Metal
n = 10
a = Metal.zeros(Float32, 2, n)

function mykernel!(a)
     i = thread_position_in_grid_1d()
     t = (1, 2, 3)

     for k in 1:2
         a[k, i] = t[k]
     end
 end

@metal threads=10 mykernel!(a)

May 19 '25 22:05 christiangnrd

@efaulhaber I don't have the technical knowledge to immediately understand what's wrong, but I seem to be able to get it working by adding @inbounds before the for loop in both the KA and Metal MWEs.

Does it work for you?

May 19 '25 22:05 christiangnrd

Do you mean this?

    @inbounds for k in 1:2
        a[k, i] = t[k]
    end

That doesn't work for me. Neither with KA nor with your pure Metal example. @LasNikas reported the same for his original example, which I reduced to this MWE on an M4 (Pro?) I think. Which CPU do you use?

May 19 '25 22:05 efaulhaber

I have an M2 Max, but I'd be surprised if that were the issue.

Can you share the output of the following code? You should be able to copy-paste into one prompt after adding the required packages in your environment.

using KernelAbstractions, Adapt, Metal; @kernel function brokenkernel!(a)
    i = @index(Global)

    t = (1, 2, 3)

    for k in 1:2
        a[k, i] = t[k]
    end
end; @kernel function workingkernel!(a)
    i = @index(Global)

    t = (1, 2, 3)

    @inbounds for k in 1:2
        a[k, i] = t[k]
    end
end; begin
    Metal.versioninfo()
    println()

    backend = MetalBackend()
    n = 10

    broken = Adapt.adapt(backend, zeros(Float32, 2, n))
    @show broken
    brokenkernel!(backend)(broken, ndrange=n)
    @show broken
    working = Adapt.adapt(backend, zeros(Float32, 2, n))
    @show working
    workingkernel!(backend)(working, ndrange=n)
    @show working;
end;

My output:

macOS 14.7.4, Darwin 23.6.0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

Julia packages: 
- Metal.jl: 1.5.1
- GPUArrays: 11.2.2
- GPUCompiler: 1.5.0
- KernelAbstractions: 0.9.34
- ObjectiveC: 3.4.1
- LLVM: 9.4.0
- LLVMDowngrader_jll: 0.6.0+0

1 device:
- Apple M2 Max (384.000 KiB allocated)

broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0]

May 20 '25 01:05 christiangnrd

@inbounds works for me. And I get the following output for @christiangnrd code:

macOS 15.4.1, Darwin 24.4.0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

Julia packages: 
- Metal.jl: 1.5.1
- GPUArrays: 11.2.2
- GPUCompiler: 1.4.1
- KernelAbstractions: 0.9.34
- ObjectiveC: 3.4.1
- LLVM: 9.3.1
- LLVMDowngrader_jll: 0.6.0+0

1 device:
- Apple M4 Pro (464.000 KiB allocated)

broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0]

May 20 '25 05:05 LasNikas

macOS 15.3.1, Darwin 24.3.0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

Julia packages: 
- Metal.jl: 1.5.1
- GPUArrays: 11.2.2
- GPUCompiler: 1.5.0
- KernelAbstractions: 0.9.34
- ObjectiveC: 3.4.1
- LLVM: 9.4.0
- LLVMDowngrader_jll: 0.6.0+0

1 device:
- Apple M2 Pro (64.000 KiB allocated)

broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
broken = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
working = Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

May 20 '25 11:05 efaulhaber

@efaulhaber We now have 2 separate issues: the first being that with bounds checking, the code doesn't work, and the second being that without bounds checking, the code seems to be broken just for you.

For the first issue, that can be worked around by using @inbounds until a proper fix (if it exists) can be made.

For the second issue, can you provide the output of

@device_code_llvm mykernel!(backend)(a, ndrange=n)

for both with and without the @inbounds?

Edit: Are you also launching Julia with --check-bounds=yes? When I do that it also does not work even with@inbounds (since it was told to ignore it)

May 20 '25 17:05 christiangnrd

Arrgh, I was using the vscode REPL, which I configured to always use --check-bounds=yes for development. It works now with @inbounds. Super nasty bug if things stop working with --check-bounds=yes.

May 21 '25 09:05 efaulhaber