Add disk cache infrastructure back with tests
Using Preferences.jl instead of environment variables and split the cache on a user defined key, GPUCompiler version, and Julia version.
Codecov Report
Patch coverage has no change and project coverage change: -85.86 :warning:
Comparison is base (
bec672c) 85.85% compared to head (051e795) 0.00%.
Additional details and impacted files
@@ Coverage Diff @@
## master #351 +/- ##
==========================================
- Coverage 85.85% 0.00% -85.86%
==========================================
Files 24 24
Lines 2871 2680 -191
==========================================
- Hits 2465 0 -2465
- Misses 406 2680 +2274
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/GPUCompiler.jl | 0.00% <ø> (-100.00%) |
:arrow_down: |
| src/cache.jl | 0.00% <0.00%> (-95.32%) |
:arrow_down: |
... and 22 files with indirect coverage changes
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Without caching:
vchuravy@odin ~/s/s/j/GemmDenseCUDA (main)> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A 0.670872 seconds (328.92 k allocations: 17.177 MiB, 80.62% compilation time)
Time to allocate B 0.001136 seconds (5 allocations: 176 bytes)
Time to initialize C 0.003191 seconds (638 allocations: 37.242 KiB, 66.78% compilation time)
Time to fill A 0.114808 seconds (4.73 k allocations: 260.202 KiB, 20.44% gc time, 62.84% compilation time)
Time to fill B 0.000006 seconds
Time to simple gemm 14.005771 seconds (14.90 M allocations: 784.978 MiB, 2.13% gc time, 21.18% compilation time)
First run:
vchuravy@odin ~/s/s/j/GemmDenseCUDA (vc/micro_optim)> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A 0.706839 seconds (328.92 k allocations: 17.177 MiB, 80.50% compilation time)
Time to allocate B 0.001365 seconds (5 allocations: 176 bytes)
Time to initialize C 0.003525 seconds (638 allocations: 37.242 KiB, 67.51% compilation time)
Time to fill A 0.130957 seconds (4.73 k allocations: 260.202 KiB, 22.79% gc time, 59.73% compilation time)
Time to fill B 0.000006 seconds
Time to simple gemm 18.979182 seconds (19.35 M allocations: 1008.772 MiB, 2.35% gc time, 17.06% compilation time)
Second run:
vchuravy@odin ~/s/s/j/GemmDenseCUDA (vc/micro_optim) [SIGINT]> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A 0.654325 seconds (328.92 k allocations: 17.177 MiB, 80.73% compilation time)
Time to allocate B 0.001132 seconds (5 allocations: 176 bytes)
Time to initialize C 0.003681 seconds (638 allocations: 37.242 KiB, 65.31% compilation time)
Time to fill A 0.108716 seconds (4.73 k allocations: 260.202 KiB, 27.39% gc time, 56.61% compilation time)
Time to fill B 0.000004 seconds
Time to simple gemm 3.616108 seconds (722.24 k allocations: 45.187 MiB, 0.60% gc time, 24.34% compilation time)
In discussion with @williamfgc, maybe we shouldn't make the cache_key static, so that an application can set it at startup? I would most likely put in the git-hash of the application.
What causes the 5s regression going from 'without cache' to 'first run'?
We were discussing with @jpsamaroo...
I'm not sure if this is already covered in this PR, but it would be nice if during development, we had an easy way to specify which kernels we're working on so they always override the cache, e.g. through a preferences.jl always_overwrite_kernels list or an optional argument to @kernel, etc.
We were discussing with @jpsamaroo... I'm not sure if this is already covered in this PR, but it would be nice if during development, we had an easy way to specify which kernels we're working on so they always override the cache, e.g. through a preferences.jl
always_overwrite_kernelslist or an optional argument to@kernel, etc.
I think that would be rather hard to do. This is still a stop-gap towards proper precompilation caching support.
On Julia 1.9 and current CUDA#master with no disk-cache first compilation got a lot faster.
args = ["10000", "10000", "10000", "5"]
Time to allocate A 0.080495 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B 0.001020 seconds (7 allocations: 256 bytes)
Time to initialize C 0.001061 seconds (7 allocations: 256 bytes)
Time to fill A 0.079274 seconds (3.64 k allocations: 192.344 KiB, 16.84% gc time)
Time to fill B 0.000005 seconds
Time to simple gemm 7.802547 seconds (8.92 M allocations: 546.678 MiB, 1.71% gc time, 0.39% compilation time)
Time to simple gemm 2.620980927
Time to simple gemm 2.634474094
Time to simple gemm 2.648787405
Time to simple gemm 2.669124524
GFLOPS: 756.618023173782 steps: 5 average_time: 2.6433417375
Time to total time 18.620834 seconds (8.97 M allocations: 549.802 MiB, 0.79% gc time, 0.16% compilation time)
Now first run with caching:
args = ["10000", "10000", "10000", "5"]
Time to allocate A 0.083496 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B 0.001083 seconds (7 allocations: 256 bytes)
Time to initialize C 0.001120 seconds (7 allocations: 256 bytes)
Time to fill A 0.084755 seconds (3.64 k allocations: 192.344 KiB, 20.16% gc time)
Time to fill B 0.000006 seconds
Time to simple gemm 8.316279 seconds (9.18 M allocations: 564.005 MiB, 1.53% gc time, 0.36% compilation time)
Time to simple gemm 2.621605666
Time to simple gemm 2.644468266
Time to simple gemm 2.656315144
Time to simple gemm 2.670673464
GFLOPS: 755.2112497959444 steps: 5 average_time: 2.648265635
Time to total time 19.164910 seconds (9.22 M allocations: 567.129 MiB, 0.75% gc time, 0.16% compilation time)
Second run hitting the cache:
args = ["10000", "10000", "10000", "5"]
Time to allocate A 0.083945 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B 0.001022 seconds (7 allocations: 256 bytes)
Time to initialize C 0.001109 seconds (7 allocations: 256 bytes)
Time to fill A 0.081859 seconds (3.64 k allocations: 192.344 KiB, 20.37% gc time)
Time to fill B 0.000006 seconds
Time to simple gemm 3.225041 seconds (176.45 k allocations: 12.828 MiB, 0.90% compilation time)
Time to simple gemm 2.683764144
Time to simple gemm 2.694396815
Time to simple gemm 2.714404264
Time to simple gemm 2.725327305
GFLOPS: 739.5155737860738 steps: 5 average_time: 2.7044731320000004
Time to total time 14.291853 seconds (221.96 k allocations: 15.949 MiB, 0.12% gc time, 0.20% compilation time)
So 7.802547 seconds to 8.316279 seconds to 3.225041 seconds. Subtracting out the baseline cost of ~2.6s
5.2s normal, 5.7s with a cold cache and 0.6s with a hot cache.
On an Oceananigans test case from time spent in setup went from 150s spent in cufunction to 1.5s spent in cufunction.
ERROR: LoadError: CUDA error: named symbol not found (code 500, ERROR_NOT_FOUND)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/libcuda.jl:27
[2] macro expansion
@ ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/libcuda.jl:35 [inlined]
[3] cuModuleGetFunction(hfunc::Base.RefValue{Ptr{CUDA.CUfunc_st}}, hmod::CUDA.CuModule, name::String)
@ CUDA ~/.julia/packages/CUDA/N71Iw/lib/utils/call.jl:26
[4] CuFunction
@ ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/module/function.jl:19 [inlined]
[5] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
@ CUDA ~/.julia/packages/CUDA/N71Iw/src/compiler/compilation.jl:235
[6] (::GPUCompiler.var"#123#124"{Dict{UInt64, Any}, UInt64, typeof(CUDA.link), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})()
@ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:250
[7] lock(f::GPUCompiler.var"#123#124"{Dict{UInt64, Any}, UInt64, typeof(CUDA.link), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, l::ReentrantLock)
@ Base ./lock.jl:229
[8] actual_compilation(cache::Dict{UInt64, Any}, key::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ft::Type, tt::Type, world::UInt64, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:247
[9] cached_compilation(cache::Dict{UInt64, Any}, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ft::Type, tt::Type, compiler::Function, linker::Function)
@ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:200
[10] macro expansion
@ ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:310 [inlined]
[11] macro expansion
@ ./lock.jl:267 [inlined]
[12] cufunction(f::typeof(Oceananigans.TurbulenceClosures.gpu_compute_ri_number!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(2162, 902, 102)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{3, KernelAbstractions.NDIteration.StaticSize{(136, 57, 102)}, KernelAbstractions.NDIteration.StaticSize{(16, 16, 1)}, Nothing, Nothing}}, NamedTuple{(:κ, :ν, :Ri), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, Tuple{Int64, Int64, Int64}, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, Nothing}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Nothing}, RiBasedVerticalDiffusivity{VerticallyImplicitTimeDiscretization, Float64, Oceananigans.TurbulenceClosures.HyperbolicTangentRiDependentTapering}, NamedTuple{(:u, :v, :w), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, NamedTuple{(:T, :S), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, Buoyancy{SeawaterBuoyancy{Float64, SeawaterPolynomials.BoussinesqEquationOfState{SeawaterPolynomials.TEOS10.TEOS10SeawaterPolynomial{Float64}, Float64}, Nothing, Nothing}, Oceananigans.Grids.ZDirection}, NamedTuple{(:T, :S), Tuple{BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Oceananigans.BoundaryConditions.DiscreteBoundaryFunction{Float64, typeof(OceanScalingTests.T_relaxation)}}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Oceananigans.BoundaryConditions.DiscreteBoundaryFunction{NTuple{4, NTuple{4, Float64}}, typeof(OceanScalingTests.surface_salinity_flux)}}}}, NamedTuple{(:time, :iteration, :stage), Tuple{Float64, Int64, Int64}}}}; kwargs::Base.Pairs{Symbol, Integer, Tuple{Symbol, Symbol}, NamedTuple{(:always_inline, :maxthreads), Tuple{Bool, Int64}}})
@ CUDA ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:306
[13] macro expansion
@ ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:104 [inlined]
[14] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.StaticSize{(2162, 902, 102)}, typeof(Oceananigans.TurbulenceClosures.gpu_compute_ri_number!)})(::NamedTuple{(:κ, :ν, :Ri), Tuple{Field{Center, Center, Face, Nothing, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, Tuple{Colon, Colon, Colon}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}}}, Field{Center, Center, Face, Nothing, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, Tuple{Colon, Colon, Colon}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}
Found by @simone-silvestri when running with a large number of nodes and a shared filesystem.
That CuFunction look-up constructor should probably do its own error handling (i.e., call unsafe_cuModuleGetFunction and print the requested function; sadly I don't think we can list the available ones).
Replaced by #557