add GPUCompiler precompilation caching
Adds ability to precompile code in GPUCompiler.GLOBAL_CI_CACHES. Taps into non-gpu caching of global constants to write the current instance of the global cache and reload on initialization. Requires user to declare, initialize and snapshot local cache. The user will then use GPUCompiler.precompile_gpucompiler. Mainly this adds and api for downstream packages such as Enzyme, CUDA, to use to cache instances of their functions. A sample SimpleGPU and Example.jl illustrate usage.
You forgot to commit precompile_native.jl.
Codecov Report
Patch coverage has no change and project coverage change: -10.21 :warning:
Comparison is base (
d5086fb) 87.08% compared to head (cc34d21) 76.87%.
:exclamation: Current head cc34d21 differs from pull request most recent head 1951087. Consider uploading reports for the commit 1951087 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #425 +/- ##
===========================================
- Coverage 87.08% 76.87% -10.21%
===========================================
Files 24 25 +1
Lines 2943 2993 +50
===========================================
- Hits 2563 2301 -262
- Misses 380 692 +312
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/GPUCompiler.jl | 100.00% <ø> (ø) |
|
| src/jlgen.jl | 66.86% <0.00%> (-16.57%) |
:arrow_down: |
| src/precompilation_cache.jl | 0.00% <0.00%> (ø) |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Could you explain what the purpose/design of this PR is? It's not at all clear to me, and looking downstream lots of functionality is entirely unused (e.g. reinit_cache).
I'm not sure why this needs anything in GPUCompiler.jl at all. Shouldn't it be sufficient for downstream packages to trigger a compilation to cache whatever they need, e.g., how JET.jl does it https://github.com/aviatesk/JET.jl/blob/b688eda6eb50a18e9e218d32650d2de23f085d50/src/JET.jl#L1382-L1396
Could you explain what the purpose/design of this PR is? It's not at all clear to me, and looking downstream lots of functionality is entirely unused (e.g.
reinit_cache).I'm not sure why this needs anything in GPUCompiler.jl at all. Shouldn't it be sufficient for downstream packages to trigger a compilation to cache whatever they need, e.g., how JET.jl does it https://github.com/aviatesk/JET.jl/blob/b688eda6eb50a18e9e218d32650d2de23f085d50/src/JET.jl#L1382-L1396
Updated initial comment and added some example code. Hope this clears some things up!
Not really, sorry. Could you describe what's the problem you want to solve, why it doesn't work with current precompilation tools, and why you opted for the design you did? Those global undocumented macros (doing questionable things) are a very non-Julian API.
Not really, sorry. Could you describe what's the problem you want to solve, why it doesn't work with current precompilation tools, and why you opted for the design you did? Those global undocumented macros (doing questionable things) are a very non-Julian API.
The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns. This commit fixes this issue requiring some user input. This would improve time-to-first-x for things requiring GPUCompiler. Have a pull request for Enzyme in the works that is one downstream use case another is in CUDA.
Been working with @vchuravy on this, with an eventual extension to be to cache binary code between runs not just type hints.
The reason for so much user involvement and use of macros is this was the simplest way forward. We use macros to create a local cache outside of the user control that has a unique id that does not conflict with the user code. We want a unique cache to eliminate duplications in the cache. Additionally we tried making all of this run at init time, but that was to late, the caches had already been serialized at that point, so we needed user involvement.
We definitely want to try to reduce this but this is a first polished attempt at the matter.
The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns.
Why not? It's just a global dict, why doesn't it get serialized in the .ji file?
The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns.
Why not? It's just a global dict, why doesn't it get serialized in the .ji file?
It is serialized, it just occurs to early in the process. By the time the dependent packages have inserted into the cache it is too late for the global. Additionally multiple children are now allowed to mutate and still have some see a cache improvements.
It is serialized, it just occurs to early in the process.
Repeating my comment from Slack: Is this because the global is serialized as part of the GPUCompiler.ji, and isn't part of, e.g., CUDA.jl's precompilation image? In that case, you could override ci_cache and use a Dict that's serialized as part of the downstream package, in order to avoid this complexity.
If that turns out to be the way to do it, we could even remove the global CI cache here to force users to bring their own (and thus get proper precompilation of GPUCompiler-inferred CIs).
So the overarching design consideration is:
Users of Enzyme.jl/CUDA.jl/AMDGPU.jl should be able to "precompile" their code. Where can we store these precompilation results, while also ensuring that they get invalidated properly?
Each user package will need to declare an "anchor"/cache that will be serialized into the .ji of this package.
So the workflow is something like:
module ParallelStencil
using CUDA
using Enzyme
import GPUCompiler
GPUCompiler.@declare_cache() # anchor
f(x) = 1
CUDA.precompile_cuda(f, (Int, ))
Enzyme.precompile_fwd(f, (Int, ))
function __init__()
GPUCompiler.@reinit_cache()
end
GPUCompiler.@snapshot_cache()
So it is not the down-stream packages of GPUCompiler that need to bring their own cache, but it is the user of those packages.
We use the cachefile of ParallelStencil to save the cache entries that were discovered during precompilation of PS,
and we then need to re-insert those cache entries into the cache.
That's at least the high-level design Collin and I came up with.
Why does the API consist of macros? Why doesn't something like this work:
module DownstreamPackage
using GPUCompiler, CUDA
const cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
const cache = GPUCompiler.ci_cache_delta(cache_snapshot)
__init__() = GPUCompiler.ci_cache_insert(cache)
end
Why does the API consist of macros? Why doesn't something like this work:
module DownstreamPackage using GPUCompiler, CUDA const cache_snapshot = GPUCompiler.ci_cache_snapshot() include("precompile.jl") const cache = GPUCompiler.ci_cache_delta(cache_snapshot) __init__() = GPUCompiler.ci_cache_insert(cache) end
That would seem to work. Updating now
Downstream packages probably should not serialize the entire cache snapshot, and rather do something like:
module DownstreamPackage
using GPUCompiler, CUDA
const cache = let
cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
GPUCompiler.ci_cache_delta(cache_snapshot)
end
__init__() = GPUCompiler.ci_cache_insert(cache)
end
But that doesn't change the actual API.
Changed API to follow @maleadt advice. Leads to a cleaner interface. Added an example kernel with caching at test/ExamplePersistentCache/GPUKernel.jl . Using this you are able to get a persistent cache, which reduces the recompilation time on consecutive calls of using Package when restarting Julia.
Remaining work is to test integration with Downstream packages such as Enzyme, Oceananigans, CUDA, AMDGPU,.... Additionally, there are potentially some algorithmic improvements for the merge algorithm to bring precompile times with and without this feature more inline.
cc @aviatesk, this may be relevant to DAECompiler (as a workaround, until we have the ability to update another module's globals, i.e., a ci cache).
See greater performance improvement when used during Enzyme.jl's precompilation phase. https://github.com/EnzymeAD/Enzyme.jl/pull/760

Improves downstream CUDA code. Creation of two CuArrays and vectorized adds:
