Add async background warmup to reduce first-kernel latency
Summary
The first GPU kernel in a Metal.jl session takes ~1.75 seconds due to one-time JIT compilation of the GPU compilation pipeline (GPUCompiler, LLVM passes, etc.). This PR introduces async background warmup during package initialization to reduce this to 0.035-0.20 seconds—a 9-50x improvement in perceived first-kernel latency.
Problem
Users experience a jarring ~2 second delay on their first GPU operation:
using Metal
a = MtlArray(rand(Float32, 1024, 1024))
@time fill!(a, 1.0f0) # 1.75s - unexpected!
@time fill!(a, 2.0f0) # 0.001s - fast as expected
This causes:
- Misleading benchmark results (first iteration 50x slower)
- Poor first impressions for new users evaluating Metal.jl
- Confusion ("is this a memory issue? a bug?")
Root Cause Analysis
The delay was previously attributed to memory page faults on large arrays. Investigation revealed this is incorrect—the actual cause is JIT compilation:
| Evidence | Finding |
|---|---|
| 1KB array | Same 1.75s delay as 512MB |
| Storage mode | No difference (Private vs Shared) |
| Compilation stages | check_method (0.2s) + LLVM IR gen (1.1s) + AIR (0.1s) |
Solution
Start a minimal kernel compilation in the background during __init__() when multiple threads are available. By the time users run their first kernel, most or all initialization is complete.
Key Discovery
Concurrent compilations share the one-time initialization overhead:
Warmup kernel: 1.620s
User kernel: 0.196s (concurrent, not 1.7s!)
Total wall: 1.808s
The user kernel benefits from shared initialization even when warmup hasn't completed, due to lock serialization in mtlfunction.
Changes
New Files
-
src/warmup.jl- Warmup kernel and publicMetal.warmup()API -
test/warmup.jl- Unit tests for warmup functionality
Modified Files
-
src/initialization.jl- Add warmup task startup in__init__() -
src/Metal.jl- Include warmup module
API Additions
Metal.warmup(; blocking=true) # Wait for warmup to complete
Metal.warmup(blocking=false) # Return immediately
Note: warmup is not exported to avoid namespace pollution. Call via Metal.warmup().
Preferences
Users can disable warmup via LocalPreferences.toml:
[Metal]
warmup = false
Performance
| Scenario | Before | After | Improvement |
|---|---|---|---|
| Explicit wait | 1.75s | 0.035s | 50x |
| Immediate (concurrent) | 1.75s | 0.20s | 9x |
| Typical workflow | 1.75s | 0.04-0.15s | 12-44x |
Trade-offs
What does the user lose? Nothing meaningful:
| Concern | Impact |
|---|---|
| Import time | Unchanged (~1.1s) - warmup runs in background, doesn't block |
| Memory | 4 bytes temporarily allocated, freed immediately |
| CPU | ~1.7s of single-threaded background work |
| Correctness | Unaffected |
| API | No breaking changes |
The background CPU usage is practically unnoticeable on modern Apple Silicon Macs (8+ cores). Benchmarks show <2% overhead on concurrent CPU workloads—well within measurement noise. The compilation work would happen anyway on the user's first kernel; we're simply shifting it to run earlier in the background while the user's code is still setting up.
Users who need to measure cold-start compilation (debugging/profiling) can disable via preference.
Why This Matters
Misleading Benchmarks Lead to Wasted Debugging Time
Without warmup, users comparing CPU vs GPU performance get dramatically wrong conclusions:
Matrix multiply (4096×4096 Float32):
CPU: 0.306s
GPU (first call): 1.012s ← User thinks GPU is 3x SLOWER
GPU (second call): 0.019s ← Actual: GPU is 16x FASTER
A user unaware of this one-time JIT cost might:
- Conclude Metal.jl is slower than CPU and abandon it
- Spend hours debugging a non-existent "performance bug"
- File issues about inconsistent profiling results
- Distrust their own benchmarks
First Impressions for New Users
(Highly relevant for computational scientists with specializations in biology, neuroscience, chemistry, etc. who might not know or care about compilation mechanics despite being the target audience for Julia)
When someone evaluates Metal.jl for the first time:
julia> using Metal
julia> a = MtlArray([1, 2, 3])
julia> @time a .+ 1 # 1.7s delay - "is this broken?"
This 2-second hang on a trivial operation creates a poor first impression, especially compared to frameworks like PyTorch or CUDA.jl where GPU operations feel instant. With async warmup, the experience becomes what users expect—responsive from the first interaction.
Testing
All existing tests pass. New tests added:
- Warmup task starts and completes without error
-
Metal.warmup()API works correctly - Kernel compilation is fast after warmup
- Concurrent compilations don't deadlock
Community Concerns
Single-threaded REPL blocking
Concern: In single-threaded mode, Julia's cooperative scheduling means JIT compilation doesn't yield, potentially blocking the REPL during warmup.
Response: Metal.jl users are pursuing GPU computing on Apple Silicon. It's reasonable to expect they've explored CPU parallelism first (setting -t auto or JULIA_NUM_THREADS), which is typically a prerequisite step before GPU for real end users in scientific computing work.
Default to old behaviour: Warmup only runs when Threads.nthreads() > 1 (i.e., when Julia is started with -t auto or JULIA_NUM_THREADS > 1).
With a single thread, Julia's cooperative task runtime means an async task would block the main thread during JIT compilation, potentially hurting perceived REPL latency. To avoid this, Metal.jl warmup is skipped entirely in single-threaded mode—users get the same behaviour as before this PR (assuming this helps with perceived responsiveness for these niche users).
This addresses @vchuravy's concern about REPL blocking while still providing the optimization for the common case (multi-threaded Julia for Metal.jl users on Apple Silicon).
Your PR no longer requires formatting changes. Thank you for your contribution!
I think this is the wrong approach. A task started in the background can negativly impact perceived latency, by blocking the REPL as an example.
There is https://github.com/JuliaGPU/GPUCompiler.jl/blob/e4a697f3b77f5c4ccb3a63354731c022648026d7/src/jlgen.jl#L681 to allow for precompilation of compiler jobs which would warm up the infrastructure and allow you to move this work to precompilation time.
It's async + measurements provided in PR description
Julia uses a cooperative task runtime, so saying that something is async doesn't mean that much. If you launch single threaded, the thread will be blocked.
Updated PR description to address these concerns.
Codecov Report
:x: Patch coverage is 21.73913% with 18 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 80.54%. Comparing base (239fa4d) to head (d4db4a1).
:warning: Report is 5 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/warmup.jl | 21.05% | 15 Missing :warning: |
| src/initialization.jl | 25.00% | 3 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #721 +/- ##
==========================================
- Coverage 80.96% 80.54% -0.42%
==========================================
Files 62 63 +1
Lines 2837 2858 +21
==========================================
+ Hits 2297 2302 +5
- Misses 540 556 +16
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I overlooked that @async pins to the parent thread. Applied Threads.@spawn in d4db4a1f.