Metal.jl Add async background warmup to reduce first-kernel latency

Summary

The first GPU kernel in a Metal.jl session takes ~1.75 seconds due to one-time JIT compilation of the GPU compilation pipeline (GPUCompiler, LLVM passes, etc.). This PR introduces async background warmup during package initialization to reduce this to 0.035-0.20 seconds—a 9-50x improvement in perceived first-kernel latency.

Problem

Users experience a jarring ~2 second delay on their first GPU operation:

using Metal
a = MtlArray(rand(Float32, 1024, 1024))
@time fill!(a, 1.0f0)  # 1.75s - unexpected!
@time fill!(a, 2.0f0)  # 0.001s - fast as expected

This causes:

Misleading benchmark results (first iteration 50x slower)
Poor first impressions for new users evaluating Metal.jl
Confusion ("is this a memory issue? a bug?")

Root Cause Analysis

The delay was previously attributed to memory page faults on large arrays. Investigation revealed this is incorrect—the actual cause is JIT compilation:

Evidence	Finding
1KB array	Same 1.75s delay as 512MB
Storage mode	No difference (Private vs Shared)
Compilation stages	check_method (0.2s) + LLVM IR gen (1.1s) + AIR (0.1s)

Solution

Start a minimal kernel compilation in the background during __init__() when multiple threads are available. By the time users run their first kernel, most or all initialization is complete.

Key Discovery

Concurrent compilations share the one-time initialization overhead:

Warmup kernel:  1.620s
User kernel:    0.196s  (concurrent, not 1.7s!)
Total wall:     1.808s

The user kernel benefits from shared initialization even when warmup hasn't completed, due to lock serialization in mtlfunction.

Changes

New Files

src/warmup.jl - Warmup kernel and public Metal.warmup() API
test/warmup.jl - Unit tests for warmup functionality

Modified Files

src/initialization.jl - Add warmup task startup in __init__()
src/Metal.jl - Include warmup module

API Additions

Metal.warmup(; blocking=true)  # Wait for warmup to complete
Metal.warmup(blocking=false)   # Return immediately

Note: warmup is not exported to avoid namespace pollution. Call via Metal.warmup().

Preferences

Users can disable warmup via LocalPreferences.toml:

[Metal]
warmup = false

Performance

Scenario	Before	After	Improvement
Explicit wait	1.75s	0.035s	50x
Immediate (concurrent)	1.75s	0.20s	9x
Typical workflow	1.75s	0.04-0.15s	12-44x

Trade-offs

What does the user lose? Nothing meaningful:

Concern	Impact
Import time	Unchanged (~1.1s) - warmup runs in background, doesn't block
Memory	4 bytes temporarily allocated, freed immediately
CPU	~1.7s of single-threaded background work
Correctness	Unaffected
API	No breaking changes

The background CPU usage is practically unnoticeable on modern Apple Silicon Macs (8+ cores). Benchmarks show <2% overhead on concurrent CPU workloads—well within measurement noise. The compilation work would happen anyway on the user's first kernel; we're simply shifting it to run earlier in the background while the user's code is still setting up.

Users who need to measure cold-start compilation (debugging/profiling) can disable via preference.

Why This Matters

Misleading Benchmarks Lead to Wasted Debugging Time

Without warmup, users comparing CPU vs GPU performance get dramatically wrong conclusions:

Matrix multiply (4096×4096 Float32):
  CPU: 0.306s
  GPU (first call):  1.012s  ← User thinks GPU is 3x SLOWER
  GPU (second call): 0.019s  ← Actual: GPU is 16x FASTER

A user unaware of this one-time JIT cost might:

Conclude Metal.jl is slower than CPU and abandon it
Spend hours debugging a non-existent "performance bug"
File issues about inconsistent profiling results
Distrust their own benchmarks

First Impressions for New Users

(Highly relevant for computational scientists with specializations in biology, neuroscience, chemistry, etc. who might not know or care about compilation mechanics despite being the target audience for Julia)

When someone evaluates Metal.jl for the first time:

julia> using Metal
julia> a = MtlArray([1, 2, 3])
julia> @time a .+ 1   # 1.7s delay - "is this broken?"

This 2-second hang on a trivial operation creates a poor first impression, especially compared to frameworks like PyTorch or CUDA.jl where GPU operations feel instant. With async warmup, the experience becomes what users expect—responsive from the first interaction.

Testing

All existing tests pass. New tests added:

Warmup task starts and completes without error
Metal.warmup() API works correctly
Kernel compilation is fast after warmup
Concurrent compilations don't deadlock

Community Concerns

Single-threaded REPL blocking

Concern: In single-threaded mode, Julia's cooperative scheduling means JIT compilation doesn't yield, potentially blocking the REPL during warmup.

Response: Metal.jl users are pursuing GPU computing on Apple Silicon. It's reasonable to expect they've explored CPU parallelism first (setting -t auto or JULIA_NUM_THREADS), which is typically a prerequisite step before GPU for real end users in scientific computing work.

Default to old behaviour: Warmup only runs when Threads.nthreads() > 1 (i.e., when Julia is started with -t auto or JULIA_NUM_THREADS > 1).

With a single thread, Julia's cooperative task runtime means an async task would block the main thread during JIT compilation, potentially hurting perceived REPL latency. To avoid this, Metal.jl warmup is skipped entirely in single-threaded mode—users get the same behaviour as before this PR (assuming this helps with perceived responsiveness for these niche users).

This addresses @vchuravy's concern about REPL blocking while still providing the optimization for the common case (multi-threaded Julia for Metal.jl users on Apple Silicon).

Dec 05 '25 10:12 KaanKesginLW

Your PR no longer requires formatting changes. Thank you for your contribution!

Dec 05 '25 10:12 github-actions[bot]

I think this is the wrong approach. A task started in the background can negativly impact perceived latency, by blocking the REPL as an example.

There is https://github.com/JuliaGPU/GPUCompiler.jl/blob/e4a697f3b77f5c4ccb3a63354731c022648026d7/src/jlgen.jl#L681 to allow for precompilation of compiler jobs which would warm up the infrastructure and allow you to move this work to precompilation time.

Dec 05 '25 12:12 vchuravy

It's async + measurements provided in PR description

Dec 05 '25 13:12 KaanKesginLW

Julia uses a cooperative task runtime, so saying that something is async doesn't mean that much. If you launch single threaded, the thread will be blocked.

Dec 05 '25 13:12 vchuravy

Updated PR description to address these concerns.

Dec 05 '25 13:12 KaanKesginLW

Codecov Report

:x: Patch coverage is 21.73913% with 18 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 80.54%. Comparing base (239fa4d) to head (d4db4a1). :warning: Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
src/warmup.jl	21.05%	15 Missing :warning:
src/initialization.jl	25.00%	3 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #721      +/-   ##
==========================================
- Coverage   80.96%   80.54%   -0.42%     
==========================================
  Files          62       63       +1     
  Lines        2837     2858      +21     
==========================================
+ Hits         2297     2302       +5     
- Misses        540      556      +16

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Dec 07 '25 12:12 codecov[bot]

I overlooked that @async pins to the parent thread. Applied Threads.@spawn in d4db4a1f.

Dec 14 '25 18:12 KaanKesginLW