GPUCompiler.jl After conversion to LLVM we should be able to delete the inferred source of the kernel.

@simonbyrne has shown me a heap-snapshot were the inferred source took up >>1GB of ram.

Sep 20 '23 21:09 vchuravy

Codecov Report

Patch coverage: 88.88% and project coverage change: -7.74% :warning:

Comparison is base (edfdc1a) 83.18% compared to head (919242d) 75.44%. Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #520      +/-   ##
==========================================
- Coverage   83.18%   75.44%   -7.74%     
==========================================
  Files          24       24              
  Lines        3300     3270      -30     
==========================================
- Hits         2745     2467     -278     
- Misses        555      803     +248

Files Changed	Coverage Δ
src/jlgen.jl	`77.85% <85.71%> (-2.07%)`	:arrow_down:
src/execution.jl	`67.79% <100.00%> (-32.21%)`	:arrow_down:

... and 13 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 20 '23 22:09 codecov[bot]

This doesn't seem to fix my issue. I'm not sure exactly where the problem is, but I did notice:

julia> GPUCompiler.GLOBAL_CI_CACHES
Dict{CompilerConfig, GPUCompiler.CodeCache} with 2 entries:
  CompilerConfig for PTXCompilerTarget => CodeCache(IdDict{MethodInstance, Vector{CodeInstance}}(MethodInstance for >>(…
  CompilerConfig for PTXCompilerTarget => CodeCache(IdDict{MethodInstance, Vector{CodeInstance}}(MethodInstance for >>(…

julia> Base.summarysize(GPUCompiler.GLOBAL_CI_CACHES) / 10^6
1396.946174

julia> Base.summarysize(collect(values(GPUCompiler.GLOBAL_CI_CACHES))[1]) / 10^6
1393.855007

julia> Base.summarysize(collect(values(GPUCompiler.GLOBAL_CI_CACHES))[2]) / 10^6
3.090233

I tried manually calling empty! on this dict: it didn't seem to make any difference, so I suspect the data is being retaine somewhere else as well.

Sep 21 '23 05:09 simonbyrne

Also, what's odd is that RES reported by top is 6.3g, but

julia> Sys.maxrss() / 10^9
17.232601088

Sep 21 '23 05:09 simonbyrne

Removed a call to jl_uncompress_ir, as IIRC it was only needed for the 1.6 overlay hack: https://github.com/JuliaGPU/GPUCompiler.jl/pull/151#issuecomment-779687366 Maybe that also helps?

Sep 21 '23 08:09 maleadt

Unfortunately still no.

Sep 21 '23 17:09 simonbyrne

You could try taking a heap snapshot.

Sep 21 '23 17:09 maleadt

I did that: it looks like most of it is still the inferred objects: Screenshot 2023-09-21 at 11 11 59 AM

I tried clearing them out manually:

for cache in values(GPUCompiler.GLOBAL_CI_CACHES)
    for insts in values(cache.dict)
        for inst in insts
            @atomic :release inst.inferred = nothing
        end
    end
end

that seemed to work:

top is still reporting 4GB of memory usage though, so not sure what is going on.

Sep 21 '23 18:09 simonbyrne

So I am only deleting top-level kernel calls. Since everything else is re-usable.

Sep 21 '23 18:09 vchuravy

@maleadt are we tracking anywhere how big the modules are we load onto the GPU?

Sep 21 '23 18:09 vchuravy

@maleadt are we tracking anywhere how big the modules are we load onto the GPU?

No, and I don't know of a way to query the size of a CuModule or CuContext.

Sep 21 '23 19:09 maleadt