Metal.jl @mtlprintf

PR status as of 2025-06-19:

Blocked by #433, which may be resolved by porting non-blocking synchronization from CUDA.jl as mentioned in #443/#532.
Figure out how to handle macOS versions without printing support (macOS 13-14)

Implement @mtlprintf and friends using os_log

TODO:

[x] Printing of floats does not work since they will be converted to doubles due to the vararg calling convention which will be caught by our IR checker
[x] Version check the @mtlprintf macro
[x] Add tests
[x] Capturing and logging are mutually exclusive

depends on: https://github.com/JuliaGPU/GPUCompiler.jl/pull/630 notify: https://github.com/JuliaGPU/Metal.jl/discussions/226

Sep 16 '24 11:09 tgymnich

@maleadt Any idea how we can implement the version check for the @mtlprintf macro? I know we could check the air version inside the kernel but I'd like to avoid that.

Also can we get rid of the double check in check_ir?

Sep 16 '24 14:09 tgymnich

Would it be worth benchmarking the performance difference between having logging active vs not?

Sep 16 '24 18:09 christiangnrd

Would it be worth benchmarking the performance difference between having logging active vs not?

@christiangnrd Sure. I don't expect there to be much overhead besides allocation of the log buffer and checking it for logs after running a kernel. But we might want to look into only conditionally adding MTLLogState since logging also prevents GPU frame capture.

Sep 17 '24 10:09 tgymnich

@maleadt Any idea how we can implement the version check for the @mtlprintf macro? I know we could check the air version inside the kernel but I'd like to avoid that.

Given that the macro expands way to early, I don't think there's anything we can do but checking in the kernel. Why are you opposed to that? GPUCompiler.jl has infrastructure to optimize those checks away, see e.g. how CUDA.jl exposes the device capability and PTX ISA version to the kernel.

Sep 17 '24 14:09 maleadt

We could also wrap the macro and accompanying functions in if Metal.macos_version() >= v"15".

Sep 17 '24 23:09 tgymnich

If we do that we should have definitions in both cases and give an informative error if Metal.macos_version() < v"15".

Sep 18 '24 00:09 christiangnrd

Actually, looks like I provided the run-time queries already: https://github.com/JuliaGPU/Metal.jl/blob/6c8291640bb43d2f19a4f5d59ac1ed4b4316580b/src/device/intrinsics/version.jl#L64-L65

So we can just use that in the generated code, generating an error when emitting code for an older platform. That of course depends on https://github.com/JuliaGPU/Metal.jl/pull/416 for meaningful reporting, but we'll get there.

I'd rather not simply check based on the macOS version during macro expansion, since we might want to target older Metal versions than the system supports.

Sep 18 '24 10:09 maleadt

Looks good! However do you know what's causing the tests to hang?

@christiangnrd The hangs are caused by this one line:

@print_and_throw "@mtlprintf requires Metal 3.2 (macOS 15) or higher"

Sep 19 '24 09:09 tgymnich

@maleadt Could we have one of the Apple Silicon runners upgraded to Sequoia so the output tests don't get ignored? Edit: All the runners are running 13.3.1. Should we also have one on macOS 14?

I would also like to see #420 merged first (with benchmarks run on macOS 15) to see how big the impact of enabling logging is.

Sep 21 '24 17:09 christiangnrd

@christiangnrd I recently made changes so that logging (e.g. MTLLogState and friends) is only enabled whenever we actually use the feature.

Sep 21 '24 19:09 tgymnich

Just pushed a whitespace-only formatting commit

Sep 21 '24 21:09 christiangnrd

@christiangnrd I recently made changes so that logging (e.g. MTLLogState and friends) is only enabled whenever we actually use the feature.

In that case I still think we should be able to test on macOS 15, but I think we should merge this as soon as it's ready.

Sep 21 '24 21:09 christiangnrd

I've upgraded one of the workers to macOS 15:

See the macos_version tag which can be used to select on this.

Sep 23 '24 14:09 maleadt

@christiangnrd The hangs are caused by this one line:
@print_and_throw "@mtlprintf requires Metal 3.2 (macOS 15) or higher"

How did this get fixed?

Sep 24 '24 05:09 maleadt

I've also updated one of the juliaecosystem workers to 15.0, so we can revert to that queue.

Sep 24 '24 09:09 maleadt

How did this get fixed?

I assume by no longer running when macos_version() < 15. I think I got ahead of myself with the review.

Sep 24 '24 11:09 christiangnrd

I assume by no longer running when macos_version() < 15.

Right; but that's not great. It means that any kernel using logging output will first generate a non-fatal error message on the host, and then hang in the kernel? Or, when we on macOS 15 use (the hypothetical, but useful) @metal metal=v"3.1" it would hang too?

EDIT: suggested capability implemented here: https://github.com/JuliaGPU/Metal.jl/pull/430

Sep 24 '24 11:09 maleadt

The following code hangs in the REPL, but not when run using include or when called from the terminal:

using Metal
function f()
    @mtlprintln("Testing...")
    return
end
@metal f()

Sep 24 '24 14:09 christiangnrd

The following code hangs in the REPL, but not when run using include or when called from the terminal

Isn't that because in the REPL we force synchronization via an AST transform hook? What happens if you synchronize manually?

Sep 24 '24 14:09 maleadt

@maleadt When I add Metal.synchronize() to the end of the file it hangs in all situations.

Sep 24 '24 14:09 christiangnrd

I opened an issue for the hang: https://github.com/JuliaGPU/Metal.jl/issues/433

Sep 27 '24 13:09 tgymnich

In the assumption that the conditional @print_and_throw generating just a trap on macOS 14 is what caused the hangs here, I simplified the logic to make the kernel launch code simply error when using unsupported logging. However, that does not fix the issue. In fact, even on my now upgraded macOS 15 installation a simple kernel doing I/O hangs...

julia> using Metal

julia> function kernel()
           @mtlprint("Hello, World\n")
           return
       end
kernel (generic function with 1 method)

julia> Metal.@sync @metal kernel()
Hello, World

# hang

The Metal.@sync is there just to illustrate what the AST transform hook is doing behind the scenes. So I guess we'll have to fix https://github.com/JuliaGPU/Metal.jl/issues/433 first, with the above being another datapoint that the generated IR is not necessarily what's the issue (which https://github.com/JuliaGPU/Metal.jl/issues/433#issuecomment-2380653135 already hinted towards).

Oct 01 '24 20:10 maleadt

Rebased, however, this has regressed and fails to compile now:

julia> function kernel()
           @mtlprint("Hello, World\n")
           return
       end
kernel (generic function with 1 method)

julia> Metal.@sync @metal kernel()
ERROR: Compilation to native code failed; see below for details.

Something with the IR downgrader not correctly handling the void ({}*)* @llvm.va_start. I'll take a look.

Dec 20 '24 11:12 maleadt

Alright, the downgrader fix was twofold, and will be bumped in https://github.com/JuliaPackaging/Yggdrasil/pull/10053:

only complain about complex constants with pointers, not simple null pointers: https://github.com/JuliaGPU/llvm-downgrade/commit/82cf06d8db804a123b49843f466f28ef756f4fa6
use correct type information for known intrinsics like void llvm.va_start(i8*) (instead of the opaque {}* we use by default to represent ptr): https://github.com/JuliaGPU/llvm-downgrade/commit/a5acfa5983ef6ca771549fea93950dda1a87c2c7

With that, we're back at the hang from https://github.com/JuliaGPU/Metal.jl/issues/433

Dec 20 '24 12:12 maleadt

Your PR requires formatting changes to meet the project's style guidelines. Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.

diff --git a/lib/mtl/log_state.jl b/lib/mtl/log_state.jl
index 2e7459f3..c7c8a60b 100644
--- a/lib/mtl/log_state.jl
+++ b/lib/mtl/log_state.jl
@@ -19,8 +19,10 @@ export MTLLogState
 
 function MTLLogState(dev::MTLDevice, descriptor::MTLLogStateDescriptor)
     err = Ref{id{NSError}}(nil)
-    handle = @objc [dev::id{MTLDevice} newLogStateWithDescriptor:descriptor::id{MTLLogStateDescriptor}
-        error:err::Ptr{id{NSError}}]::id{MTLLogState}
+    handle = @objc [
+        dev::id{MTLDevice} newLogStateWithDescriptor:descriptor::id{MTLLogStateDescriptor}
+        error:err::Ptr{id{NSError}}
+    ]::id{MTLLogState}
     err[] == nil || throw(NSError(err[]))
-    MTLLogState(handle)
+    return MTLLogState(handle)
 end
diff --git a/test/output.jl b/test/output.jl
index 17047527..f792841a 100644
--- a/test/output.jl
+++ b/test/output.jl
@@ -1,150 +1,150 @@
 @testset "output" begin
 
-@static if Metal.macos_version() < v"15"
+    @static if Metal.macos_version() < v"15"
 
-@warn "Skipping output tests in macOS 14 and below"
+        @warn "Skipping output tests in macOS 14 and below"
 
-function kernel()
-    @mtlprint("Hello, World\n")
-    return
-end
-@test_throws "Logging is only supported on macOS 15 or higher" @metal kernel()
+        function kernel()
+            @mtlprint("Hello, World\n")
+            return
+        end
+        @test_throws "Logging is only supported on macOS 15 or higher" @metal kernel()
 
-else
+    else
 
-@testset "formatted output" begin
-    _, out = @grab_output @on_device @mtlprintf("")
-    @test out == ""
+        @testset "formatted output" begin
+            _, out = @grab_output @on_device @mtlprintf("")
+            @test out == ""
 
-    _, out = @grab_output @on_device @mtlprintf("Testing...\n")
-    @test out == "Testing...\n"
+            _, out = @grab_output @on_device @mtlprintf("Testing...\n")
+            @test out == "Testing...\n"
 
-    # narrow integer
-    _, out = @grab_output @on_device @mtlprintf("Testing %d %d...\n", Int32(1), Int32(2))
-    @test out == "Testing 1 2...\n"
+            # narrow integer
+            _, out = @grab_output @on_device @mtlprintf("Testing %d %d...\n", Int32(1), Int32(2))
+            @test out == "Testing 1 2...\n"
 
-    # wide integer
-    _, out = @grab_output @on_device @mtlprintf("Testing %ld %ld...\n", Int64(1), Int64(2))
-    @test out == "Testing 1 2...\n"
+            # wide integer
+            _, out = @grab_output @on_device @mtlprintf("Testing %ld %ld...\n", Int64(1), Int64(2))
+            @test out == "Testing 1 2...\n"
 
-    _, out = @grab_output @on_device begin
-        @mtlprintf("foo")
-        @mtlprintf("bar\n")
-    end
-    @test out == "foobar\n"
+            _, out = @grab_output @on_device begin
+                @mtlprintf("foo")
+                @mtlprintf("bar\n")
+            end
+            @test out == "foobar\n"
 
-    # c argument promotions
-    function kernel(A)
-        @mtlprintf("%f %f\n", A[1], A[1])
-        return
-    end
-    x = mtl(ones(2, 2))
-    _, out = @grab_output begin
-        Metal.@sync @metal kernel(x)
-    end
-    @test out == "1.000000 1.000000\n"
-end
+            # c argument promotions
+            function kernel(A)
+                @mtlprintf("%f %f\n", A[1], A[1])
+                return
+            end
+            x = mtl(ones(2, 2))
+            _, out = @grab_output begin
+                Metal.@sync @metal kernel(x)
+            end
+            @test out == "1.000000 1.000000\n"
+        end
 
-@testset "@mtlprint" begin
-    # basic @mtlprint/@mtlprintln
+        @testset "@mtlprint" begin
+            # basic @mtlprint/@mtlprintln
 
-    _, out = @grab_output @on_device @mtlprint("Hello, World\n")
-    @test out == "Hello, World\n"
+            _, out = @grab_output @on_device @mtlprint("Hello, World\n")
+            @test out == "Hello, World\n"
 
-    _, out = @grab_output @on_device @mtlprintln("Hello, World")
-    @test out == "Hello, World\n"
+            _, out = @grab_output @on_device @mtlprintln("Hello, World")
+            @test out == "Hello, World\n"
 
 
-    # argument interpolation (by the macro, so can use literals)
+            # argument interpolation (by the macro, so can use literals)
 
-    _, out = @grab_output @on_device @mtlprint("foobar")
-    @test out == "foobar"
+            _, out = @grab_output @on_device @mtlprint("foobar")
+            @test out == "foobar"
 
-    _, out = @grab_output @on_device @mtlprint(:foobar)
-    @test out == "foobar"
+            _, out = @grab_output @on_device @mtlprint(:foobar)
+            @test out == "foobar"
 
-    _, out = @grab_output @on_device @mtlprint("foo", "bar")
-    @test out == "foobar"
+            _, out = @grab_output @on_device @mtlprint("foo", "bar")
+            @test out == "foobar"
 
-    _, out = @grab_output @on_device @mtlprint("foobar ", 42)
-    @test out == "foobar 42"
+            _, out = @grab_output @on_device @mtlprint("foobar ", 42)
+            @test out == "foobar 42"
 
-    _, out = @grab_output @on_device @mtlprint("foobar $(42)")
-    @test out == "foobar 42"
+            _, out = @grab_output @on_device @mtlprint("foobar $(42)")
+            @test out == "foobar 42"
 
-    _, out = @grab_output @on_device @mtlprint("foobar $(4)", 2)
-    @test out == "foobar 42"
+            _, out = @grab_output @on_device @mtlprint("foobar $(4)", 2)
+            @test out == "foobar 42"
 
-    _, out = @grab_output @on_device @mtlprint("foobar ", 4, "$(2)")
-    @test out == "foobar 42"
+            _, out = @grab_output @on_device @mtlprint("foobar ", 4, "$(2)")
+            @test out == "foobar 42"
 
-    _, out = @grab_output @on_device @mtlprint(42)
-    @test out == "42"
+            _, out = @grab_output @on_device @mtlprint(42)
+            @test out == "42"
 
-    _, out = @grab_output @on_device @mtlprint(4, 2)
-    @test out == "42"
+            _, out = @grab_output @on_device @mtlprint(4, 2)
+            @test out == "42"
 
-    _, out = @grab_output @on_device @mtlprint(Any)
-    @test out == "Any"
+            _, out = @grab_output @on_device @mtlprint(Any)
+            @test out == "Any"
 
-    _, out = @grab_output @on_device @mtlprintln("foobar $(42)")
-    @test out == "foobar 42\n"
+            _, out = @grab_output @on_device @mtlprintln("foobar $(42)")
+            @test out == "foobar 42\n"
 
 
-    # argument types
+            # argument types
 
-    # we're testing the generated functions now, so can't use literals
-    function test_output(val, str)
-        canary = rand(Int32) # if we mess up the main arg, this one will print wrong
-        _, out = @grab_output @on_device @mtlprint(val, " (", canary, ")")
-        @test out == "$(str) ($(Int(canary)))"
-    end
+            # we're testing the generated functions now, so can't use literals
+            function test_output(val, str)
+                canary = rand(Int32) # if we mess up the main arg, this one will print wrong
+                _, out = @grab_output @on_device @mtlprint(val, " (", canary, ")")
+                @test out == "$(str) ($(Int(canary)))"
+            end
 
-    for typ in (Int16, Int32, Int64, UInt16, UInt32, UInt64)
-        test_output(typ(42), "42")
-    end
+            for typ in (Int16, Int32, Int64, UInt16, UInt32, UInt64)
+                test_output(typ(42), "42")
+            end
 
-    for typ in (Float32,)
-        test_output(typ(42), "42.000000")
-    end
+            for typ in (Float32,)
+                test_output(typ(42), "42.000000")
+            end
 
-    test_output(Cchar('c'), "c")
+            test_output(Cchar('c'), "c")
 
-    for typ in (Ptr{Cvoid}, Ptr{Int})
-        ptr = convert(typ, Int(0x12345))
-        test_output(ptr, "0x12345")
-    end
+            for typ in (Ptr{Cvoid}, Ptr{Int})
+                ptr = convert(typ, Int(0x00012345))
+                test_output(ptr, "0x12345")
+            end
 
-    test_output(true, "1")
-    test_output(false, "0")
+            test_output(true, "1")
+            test_output(false, "0")
 
-    test_output((1,), "(1,)")
-    test_output((1,2), "(1, 2)")
-    test_output((1,2,3.0f0), "(1, 2, 3.000000)")
+            test_output((1,), "(1,)")
+            test_output((1, 2), "(1, 2)")
+            test_output((1, 2, 3.0f0), "(1, 2, 3.000000)")
 
-    # escaping
+            # escaping
 
-    kernel1(val) = (@mtlprint(val); nothing)
-    _, out = @grab_output @on_device kernel1(42)
-    @test out == "42"
+            kernel1(val) = (@mtlprint(val); nothing)
+            _, out = @grab_output @on_device kernel1(42)
+            @test out == "42"
 
-    kernel2(val) = (@mtlprintln(val); nothing)
-    _, out = @grab_output @on_device kernel2(42)
-    @test out == "42\n"
-end
+            kernel2(val) = (@mtlprintln(val); nothing)
+            _, out = @grab_output @on_device kernel2(42)
+            @test out == "42\n"
+        end
 
-@testset "@mtlshow" begin
-    function kernel()
-        seven_i32 = Int32(7)
-        three_f32 = Float32(3)
-        @mtlshow seven_i32
-        @mtlshow three_f32
-        @mtlshow 1f0 + 4f0
-        return
-    end
+        @testset "@mtlshow" begin
+            function kernel()
+                seven_i32 = Int32(7)
+                three_f32 = Float32(3)
+                @mtlshow seven_i32
+                @mtlshow three_f32
+                @mtlshow 1.0f0 + 4.0f0
+                return
+            end
 
-    _, out = @grab_output @on_device kernel()
-    @test out == "seven_i32 = 7\nthree_f32 = 3.000000\n1.0f0 + 4.0f0 = 5.000000\n"
-end
-end
+            _, out = @grab_output @on_device kernel()
+            @test out == "seven_i32 = 7\nthree_f32 = 3.000000\n1.0f0 + 4.0f0 = 5.000000\n"
+        end
+    end
 end

Jan 27 '25 17:01 github-actions[bot]

@tgymnich Tests now seem to be passing locally with the 2 suggestions I just made. Is this still blocked?

I'll take care of the rebase since the conflicts are my fault.

Feb 04 '25 18:02 christiangnrd

This is awesome! Thank you so much @christiangnrd.

Feb 05 '25 00:02 tgymnich

How is this not blocked by https://github.com/JuliaGPU/Metal.jl/issues/433 anymore?

Feb 06 '25 10:02 maleadt

@maleadt https://github.com/JuliaGPU/Metal.jl/issues/433 is still an issue on some older devices. However the main issue with this PR (mentioned here: https://github.com/JuliaGPU/Metal.jl/pull/418#issuecomment-2386993263) can be resolved by using Core.print instead of Base.print in the command buffer callback.

Feb 07 '25 17:02 tgymnich

However the main issue with this PR (mentioned here: #418 (comment)) can be resolved by using Core.print instead of Base.print in the command buffer callback.

I'm having a hard time understanding that. If switching to a non-yielding print fixes the issue, it implies we were running into https://github.com/JuliaGPU/Metal.jl/issues/532. But that doesn't match you mentioning 100% GPU usage in https://github.com/JuliaGPU/Metal.jl/issues/433?

Feb 10 '25 09:02 maleadt