GPUCompiler.jl icon indicating copy to clipboard operation
GPUCompiler.jl copied to clipboard

After conversion to LLVM we should be able to delete the inferred source of the kernel.

Open vchuravy opened this issue 2 years ago • 10 comments

@simonbyrne has shown me a heap-snapshot were the inferred source took up >>1GB of ram.

vchuravy avatar Sep 20 '23 21:09 vchuravy

Codecov Report

Patch coverage: 88.88% and project coverage change: -7.74% :warning:

Comparison is base (edfdc1a) 83.18% compared to head (919242d) 75.44%. Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #520      +/-   ##
==========================================
- Coverage   83.18%   75.44%   -7.74%     
==========================================
  Files          24       24              
  Lines        3300     3270      -30     
==========================================
- Hits         2745     2467     -278     
- Misses        555      803     +248     
Files Changed Coverage Δ
src/jlgen.jl 77.85% <85.71%> (-2.07%) :arrow_down:
src/execution.jl 67.79% <100.00%> (-32.21%) :arrow_down:

... and 13 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Sep 20 '23 22:09 codecov[bot]

This doesn't seem to fix my issue. I'm not sure exactly where the problem is, but I did notice:

julia> GPUCompiler.GLOBAL_CI_CACHES
Dict{CompilerConfig, GPUCompiler.CodeCache} with 2 entries:
  CompilerConfig for PTXCompilerTarget => CodeCache(IdDict{MethodInstance, Vector{CodeInstance}}(MethodInstance for >>(…
  CompilerConfig for PTXCompilerTarget => CodeCache(IdDict{MethodInstance, Vector{CodeInstance}}(MethodInstance for >>(…

julia> Base.summarysize(GPUCompiler.GLOBAL_CI_CACHES) / 10^6
1396.946174

julia> Base.summarysize(collect(values(GPUCompiler.GLOBAL_CI_CACHES))[1]) / 10^6
1393.855007

julia> Base.summarysize(collect(values(GPUCompiler.GLOBAL_CI_CACHES))[2]) / 10^6
3.090233

I tried manually calling empty! on this dict: it didn't seem to make any difference, so I suspect the data is being retaine somewhere else as well.

simonbyrne avatar Sep 21 '23 05:09 simonbyrne

Also, what's odd is that RES reported by top is 6.3g, but

julia> Sys.maxrss() / 10^9
17.232601088

simonbyrne avatar Sep 21 '23 05:09 simonbyrne

Removed a call to jl_uncompress_ir, as IIRC it was only needed for the 1.6 overlay hack: https://github.com/JuliaGPU/GPUCompiler.jl/pull/151#issuecomment-779687366 Maybe that also helps?

maleadt avatar Sep 21 '23 08:09 maleadt

Unfortunately still no.

simonbyrne avatar Sep 21 '23 17:09 simonbyrne

You could try taking a heap snapshot.

maleadt avatar Sep 21 '23 17:09 maleadt

I did that: it looks like most of it is still the inferred objects: Screenshot 2023-09-21 at 11 11 59 AM

I tried clearing them out manually:

for cache in values(GPUCompiler.GLOBAL_CI_CACHES)
    for insts in values(cache.dict)
        for inst in insts
            @atomic :release inst.inferred = nothing
        end
    end
end

that seemed to work:

Screenshot 2023-09-21 at 11 09 45 AM

top is still reporting 4GB of memory usage though, so not sure what is going on.

simonbyrne avatar Sep 21 '23 18:09 simonbyrne

So I am only deleting top-level kernel calls. Since everything else is re-usable.

vchuravy avatar Sep 21 '23 18:09 vchuravy

@maleadt are we tracking anywhere how big the modules are we load onto the GPU?

vchuravy avatar Sep 21 '23 18:09 vchuravy

@maleadt are we tracking anywhere how big the modules are we load onto the GPU?

No, and I don't know of a way to query the size of a CuModule or CuContext.

maleadt avatar Sep 21 '23 19:09 maleadt