GPUCompiler.jl add GPUCompiler precompilation caching

Adds ability to precompile code in GPUCompiler.GLOBAL_CI_CACHES. Taps into non-gpu caching of global constants to write the current instance of the global cache and reload on initialization. Requires user to declare, initialize and snapshot local cache. The user will then use GPUCompiler.precompile_gpucompiler. Mainly this adds and api for downstream packages such as Enzyme, CUDA, to use to cache instances of their functions. A sample SimpleGPU and Example.jl illustrate usage.

Apr 09 '23 00:04 collinwarner

You forgot to commit precompile_native.jl.

Apr 09 '23 06:04 maleadt

Codecov Report

Patch coverage has no change and project coverage change: -10.21 :warning:

Comparison is base (d5086fb) 87.08% compared to head (cc34d21) 76.87%.

:exclamation: Current head cc34d21 differs from pull request most recent head 1951087. Consider uploading reports for the commit 1951087 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #425       +/-   ##
===========================================
- Coverage   87.08%   76.87%   -10.21%     
===========================================
  Files          24       25        +1     
  Lines        2943     2993       +50     
===========================================
- Hits         2563     2301      -262     
- Misses        380      692      +312

Impacted Files	Coverage Δ
src/GPUCompiler.jl	`100.00% <ø> (ø)`
src/jlgen.jl	`66.86% <0.00%> (-16.57%)`	:arrow_down:
src/precompilation_cache.jl	`0.00% <0.00%> (ø)`

... and 14 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Apr 09 '23 21:04 codecov[bot]

Could you explain what the purpose/design of this PR is? It's not at all clear to me, and looking downstream lots of functionality is entirely unused (e.g. reinit_cache).

I'm not sure why this needs anything in GPUCompiler.jl at all. Shouldn't it be sufficient for downstream packages to trigger a compilation to cache whatever they need, e.g., how JET.jl does it https://github.com/aviatesk/JET.jl/blob/b688eda6eb50a18e9e218d32650d2de23f085d50/src/JET.jl#L1382-L1396

Apr 11 '23 07:04 maleadt

Could you explain what the purpose/design of this PR is? It's not at all clear to me, and looking downstream lots of functionality is entirely unused (e.g. reinit_cache).

I'm not sure why this needs anything in GPUCompiler.jl at all. Shouldn't it be sufficient for downstream packages to trigger a compilation to cache whatever they need, e.g., how JET.jl does it https://github.com/aviatesk/JET.jl/blob/b688eda6eb50a18e9e218d32650d2de23f085d50/src/JET.jl#L1382-L1396

Updated initial comment and added some example code. Hope this clears some things up!

Apr 11 '23 17:04 collinwarner

Not really, sorry. Could you describe what's the problem you want to solve, why it doesn't work with current precompilation tools, and why you opted for the design you did? Those global undocumented macros (doing questionable things) are a very non-Julian API.

Apr 11 '23 18:04 maleadt

Not really, sorry. Could you describe what's the problem you want to solve, why it doesn't work with current precompilation tools, and why you opted for the design you did? Those global undocumented macros (doing questionable things) are a very non-Julian API.

The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns. This commit fixes this issue requiring some user input. This would improve time-to-first-x for things requiring GPUCompiler. Have a pull request for Enzyme in the works that is one downstream use case another is in CUDA.

Been working with @vchuravy on this, with an eventual extension to be to cache binary code between runs not just type hints.

The reason for so much user involvement and use of macros is this was the simplest way forward. We use macros to create a local cache outside of the user control that has a unique id that does not conflict with the user code. We want a unique cache to eliminate duplications in the cache. Additionally we tried making all of this run at init time, but that was to late, the caches had already been serialized at that point, so we needed user involvement.

We definitely want to try to reduce this but this is a first polished attempt at the matter.

Apr 11 '23 18:04 collinwarner

The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns.

Why not? It's just a global dict, why doesn't it get serialized in the .ji file?

Apr 12 '23 11:04 maleadt

The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns.

Why not? It's just a global dict, why doesn't it get serialized in the .ji file?

It is serialized, it just occurs to early in the process. By the time the dependent packages have inserted into the cache it is too late for the global. Additionally multiple children are now allowed to mutate and still have some see a cache improvements.

Apr 12 '23 15:04 collinwarner

It is serialized, it just occurs to early in the process.

Repeating my comment from Slack: Is this because the global is serialized as part of the GPUCompiler.ji, and isn't part of, e.g., CUDA.jl's precompilation image? In that case, you could override ci_cache and use a Dict that's serialized as part of the downstream package, in order to avoid this complexity.

If that turns out to be the way to do it, we could even remove the global CI cache here to force users to bring their own (and thus get proper precompilation of GPUCompiler-inferred CIs).

Apr 12 '23 18:04 maleadt

So the overarching design consideration is:

Users of Enzyme.jl/CUDA.jl/AMDGPU.jl should be able to "precompile" their code. Where can we store these precompilation results, while also ensuring that they get invalidated properly?

Each user package will need to declare an "anchor"/cache that will be serialized into the .ji of this package. So the workflow is something like:

module ParallelStencil

using CUDA
using Enzyme
import GPUCompiler

GPUCompiler.@declare_cache() # anchor

f(x) = 1
CUDA.precompile_cuda(f, (Int, ))
Enzyme.precompile_fwd(f, (Int, ))

function __init__()
    GPUCompiler.@reinit_cache()
end

GPUCompiler.@snapshot_cache()

So it is not the down-stream packages of GPUCompiler that need to bring their own cache, but it is the user of those packages.

We use the cachefile of ParallelStencil to save the cache entries that were discovered during precompilation of PS, and we then need to re-insert those cache entries into the cache.

That's at least the high-level design Collin and I came up with.

Apr 12 '23 19:04 vchuravy

Why does the API consist of macros? Why doesn't something like this work:

module DownstreamPackage

using GPUCompiler, CUDA

const cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
const cache = GPUCompiler.ci_cache_delta(cache_snapshot)

__init__() = GPUCompiler.ci_cache_insert(cache)

end

Apr 14 '23 07:04 maleadt

Why does the API consist of macros? Why doesn't something like this work:

module DownstreamPackage

using GPUCompiler, CUDA

const cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
const cache = GPUCompiler.ci_cache_delta(cache_snapshot)

__init__() = GPUCompiler.ci_cache_insert(cache)

end

That would seem to work. Updating now

Apr 16 '23 17:04 collinwarner

Downstream packages probably should not serialize the entire cache snapshot, and rather do something like:

module DownstreamPackage

using GPUCompiler, CUDA

const cache = let
    cache_snapshot = GPUCompiler.ci_cache_snapshot()
    include("precompile.jl")
    GPUCompiler.ci_cache_delta(cache_snapshot)
end


__init__() = GPUCompiler.ci_cache_insert(cache)

end

But that doesn't change the actual API.

Apr 17 '23 18:04 maleadt

Changed API to follow @maleadt advice. Leads to a cleaner interface. Added an example kernel with caching at test/ExamplePersistentCache/GPUKernel.jl . Using this you are able to get a persistent cache, which reduces the recompilation time on consecutive calls of using Package when restarting Julia.

Apr 23 '23 00:04 collinwarner

Remaining work is to test integration with Downstream packages such as Enzyme, Oceananigans, CUDA, AMDGPU,.... Additionally, there are potentially some algorithmic improvements for the merge algorithm to bring precompile times with and without this feature more inline.

Apr 23 '23 00:04 collinwarner

cc @aviatesk, this may be relevant to DAECompiler (as a workaround, until we have the ability to update another module's globals, i.e., a ci cache).

Apr 23 '23 18:04 maleadt

See greater performance improvement when used during Enzyme.jl's precompilation phase. https://github.com/EnzymeAD/Enzyme.jl/pull/760

Apr 23 '23 19:04 collinwarner

Improves downstream CUDA code. Creation of two CuArrays and vectorized adds:

May 02 '23 18:05 collinwarner