GPUCompiler.jl Add disk cache infrastructure back with tests

Using Preferences.jl instead of environment variables and split the cache on a user defined key, GPUCompiler version, and Julia version.

Aug 02 '22 14:08 vchuravy

Codecov Report

Patch coverage has no change and project coverage change: -85.86 :warning:

Comparison is base (bec672c) 85.85% compared to head (051e795) 0.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #351       +/-   ##
==========================================
- Coverage   85.85%   0.00%   -85.86%     
==========================================
  Files          24      24               
  Lines        2871    2680      -191     
==========================================
- Hits         2465       0     -2465     
- Misses        406    2680     +2274

Impacted Files	Coverage Δ
src/GPUCompiler.jl	`0.00% <ø> (-100.00%)`	:arrow_down:
src/cache.jl	`0.00% <0.00%> (-95.32%)`	:arrow_down:

... and 22 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Aug 02 '22 16:08 codecov[bot]

Without caching:

vchuravy@odin ~/s/s/j/GemmDenseCUDA (main)> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.670872 seconds (328.92 k allocations: 17.177 MiB, 80.62% compilation time)
Time to allocate B  0.001136 seconds (5 allocations: 176 bytes)
Time to initialize C  0.003191 seconds (638 allocations: 37.242 KiB, 66.78% compilation time)
Time to fill A  0.114808 seconds (4.73 k allocations: 260.202 KiB, 20.44% gc time, 62.84% compilation time)
Time to fill B  0.000006 seconds
Time to simple gemm  14.005771 seconds (14.90 M allocations: 784.978 MiB, 2.13% gc time, 21.18% compilation time)

First run:

vchuravy@odin ~/s/s/j/GemmDenseCUDA (vc/micro_optim)> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.706839 seconds (328.92 k allocations: 17.177 MiB, 80.50% compilation time)
Time to allocate B  0.001365 seconds (5 allocations: 176 bytes)
Time to initialize C  0.003525 seconds (638 allocations: 37.242 KiB, 67.51% compilation time)
Time to fill A  0.130957 seconds (4.73 k allocations: 260.202 KiB, 22.79% gc time, 59.73% compilation time)
Time to fill B  0.000006 seconds
Time to simple gemm  18.979182 seconds (19.35 M allocations: 1008.772 MiB, 2.35% gc time, 17.06% compilation time)

Second run:

vchuravy@odin ~/s/s/j/GemmDenseCUDA (vc/micro_optim) [SIGINT]> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.654325 seconds (328.92 k allocations: 17.177 MiB, 80.73% compilation time)
Time to allocate B  0.001132 seconds (5 allocations: 176 bytes)
Time to initialize C  0.003681 seconds (638 allocations: 37.242 KiB, 65.31% compilation time)
Time to fill A  0.108716 seconds (4.73 k allocations: 260.202 KiB, 27.39% gc time, 56.61% compilation time)
Time to fill B  0.000004 seconds
Time to simple gemm   3.616108 seconds (722.24 k allocations: 45.187 MiB, 0.60% gc time, 24.34% compilation time)

Aug 02 '22 18:08 vchuravy

In discussion with @williamfgc, maybe we shouldn't make the cache_key static, so that an application can set it at startup? I would most likely put in the git-hash of the application.

Aug 02 '22 18:08 vchuravy

What causes the 5s regression going from 'without cache' to 'first run'?

Aug 03 '22 12:08 maleadt

We were discussing with @jpsamaroo... I'm not sure if this is already covered in this PR, but it would be nice if during development, we had an easy way to specify which kernels we're working on so they always override the cache, e.g. through a preferences.jl always_overwrite_kernels list or an optional argument to @kernel, etc.

Aug 03 '22 16:08 claforte

We were discussing with @jpsamaroo... I'm not sure if this is already covered in this PR, but it would be nice if during development, we had an easy way to specify which kernels we're working on so they always override the cache, e.g. through a preferences.jl always_overwrite_kernels list or an optional argument to @kernel, etc.

I think that would be rather hard to do. This is still a stop-gap towards proper precompilation caching support.

Aug 03 '22 16:08 vchuravy

On Julia 1.9 and current CUDA#master with no disk-cache first compilation got a lot faster.

args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.080495 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B  0.001020 seconds (7 allocations: 256 bytes)
Time to initialize C  0.001061 seconds (7 allocations: 256 bytes)
Time to fill A  0.079274 seconds (3.64 k allocations: 192.344 KiB, 16.84% gc time)
Time to fill B  0.000005 seconds
Time to simple gemm   7.802547 seconds (8.92 M allocations: 546.678 MiB, 1.71% gc time, 0.39% compilation time)
Time to simple gemm 2.620980927
Time to simple gemm 2.634474094
Time to simple gemm 2.648787405
Time to simple gemm 2.669124524
GFLOPS: 756.618023173782 steps: 5 average_time: 2.6433417375
Time to total time  18.620834 seconds (8.97 M allocations: 549.802 MiB, 0.79% gc time, 0.16% compilation time)

Mar 22 '23 00:03 vchuravy

Now first run with caching:

args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.083496 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B  0.001083 seconds (7 allocations: 256 bytes)
Time to initialize C  0.001120 seconds (7 allocations: 256 bytes)
Time to fill A  0.084755 seconds (3.64 k allocations: 192.344 KiB, 20.16% gc time)
Time to fill B  0.000006 seconds
Time to simple gemm   8.316279 seconds (9.18 M allocations: 564.005 MiB, 1.53% gc time, 0.36% compilation time)
Time to simple gemm 2.621605666
Time to simple gemm 2.644468266
Time to simple gemm 2.656315144
Time to simple gemm 2.670673464
GFLOPS: 755.2112497959444 steps: 5 average_time: 2.648265635
Time to total time  19.164910 seconds (9.22 M allocations: 567.129 MiB, 0.75% gc time, 0.16% compilation time)

Second run hitting the cache:

args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.083945 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B  0.001022 seconds (7 allocations: 256 bytes)
Time to initialize C  0.001109 seconds (7 allocations: 256 bytes)
Time to fill A  0.081859 seconds (3.64 k allocations: 192.344 KiB, 20.37% gc time)
Time to fill B  0.000006 seconds
Time to simple gemm   3.225041 seconds (176.45 k allocations: 12.828 MiB, 0.90% compilation time)
Time to simple gemm 2.683764144
Time to simple gemm 2.694396815
Time to simple gemm 2.714404264
Time to simple gemm 2.725327305
GFLOPS: 739.5155737860738 steps: 5 average_time: 2.7044731320000004
Time to total time  14.291853 seconds (221.96 k allocations: 15.949 MiB, 0.12% gc time, 0.20% compilation time)

So 7.802547 seconds to 8.316279 seconds to 3.225041 seconds. Subtracting out the baseline cost of ~2.6s

5.2s normal, 5.7s with a cold cache and 0.6s with a hot cache.

Mar 22 '23 00:03 vchuravy

On an Oceananigans test case from time spent in setup went from 150s spent in cufunction to 1.5s spent in cufunction.

Mar 27 '23 21:03 vchuravy

ERROR: LoadError: CUDA error: named symbol not found (code 500, ERROR_NOT_FOUND)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/libcuda.jl:27
  [2] macro expansion
    @ ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/libcuda.jl:35 [inlined]
  [3] cuModuleGetFunction(hfunc::Base.RefValue{Ptr{CUDA.CUfunc_st}}, hmod::CUDA.CuModule, name::String)
    @ CUDA ~/.julia/packages/CUDA/N71Iw/lib/utils/call.jl:26
  [4] CuFunction
    @ ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/module/function.jl:19 [inlined]
  [5] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA ~/.julia/packages/CUDA/N71Iw/src/compiler/compilation.jl:235
  [6] (::GPUCompiler.var"#123#124"{Dict{UInt64, Any}, UInt64, typeof(CUDA.link), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})()
    @ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:250
  [7] lock(f::GPUCompiler.var"#123#124"{Dict{UInt64, Any}, UInt64, typeof(CUDA.link), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, l::ReentrantLock)
    @ Base ./lock.jl:229
  [8] actual_compilation(cache::Dict{UInt64, Any}, key::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ft::Type, tt::Type, world::UInt64, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:247
  [9] cached_compilation(cache::Dict{UInt64, Any}, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ft::Type, tt::Type, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:200
 [10] macro expansion
    @ ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:310 [inlined]
 [11] macro expansion
    @ ./lock.jl:267 [inlined]
 [12] cufunction(f::typeof(Oceananigans.TurbulenceClosures.gpu_compute_ri_number!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(2162, 902, 102)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{3, KernelAbstractions.NDIteration.StaticSize{(136, 57, 102)}, KernelAbstractions.NDIteration.StaticSize{(16, 16, 1)}, Nothing, Nothing}}, NamedTuple{(:κ, :ν, :Ri), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, Tuple{Int64, Int64, Int64}, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, Nothing}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Nothing}, RiBasedVerticalDiffusivity{VerticallyImplicitTimeDiscretization, Float64, Oceananigans.TurbulenceClosures.HyperbolicTangentRiDependentTapering}, NamedTuple{(:u, :v, :w), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, NamedTuple{(:T, :S), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, Buoyancy{SeawaterBuoyancy{Float64, SeawaterPolynomials.BoussinesqEquationOfState{SeawaterPolynomials.TEOS10.TEOS10SeawaterPolynomial{Float64}, Float64}, Nothing, Nothing}, Oceananigans.Grids.ZDirection}, NamedTuple{(:T, :S), Tuple{BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Oceananigans.BoundaryConditions.DiscreteBoundaryFunction{Float64, typeof(OceanScalingTests.T_relaxation)}}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Oceananigans.BoundaryConditions.DiscreteBoundaryFunction{NTuple{4, NTuple{4, Float64}}, typeof(OceanScalingTests.surface_salinity_flux)}}}}, NamedTuple{(:time, :iteration, :stage), Tuple{Float64, Int64, Int64}}}}; kwargs::Base.Pairs{Symbol, Integer, Tuple{Symbol, Symbol}, NamedTuple{(:always_inline, :maxthreads), Tuple{Bool, Int64}}})
    @ CUDA ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:306
 [13] macro expansion
    @ ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:104 [inlined]
 [14] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.StaticSize{(2162, 902, 102)}, typeof(Oceananigans.TurbulenceClosures.gpu_compute_ri_number!)})(::NamedTuple{(:κ, :ν, :Ri), Tuple{Field{Center, Center, Face, Nothing, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, Tuple{Colon, Colon, Colon}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}}}, Field{Center, Center, Face, Nothing, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, Tuple{Colon, Colon, Colon}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}

Found by @simone-silvestri when running with a large number of nodes and a shared filesystem.

Apr 16 '23 22:04 vchuravy

That CuFunction look-up constructor should probably do its own error handling (i.e., call unsafe_cuModuleGetFunction and print the requested function; sadly I don't think we can list the available ones).

Apr 17 '23 17:04 maleadt

Replaced by #557

Apr 03 '24 16:04 vchuravy