CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

Cannot reclaim GPU Memory; CUDA.reclaim()

Open jackn11 opened this issue 2 years ago • 1 comments

When I set all GPU variables to nothing and call CUDA.reclaim(), my GPU memory remains full (does not go back to initial usage).

Currently the models being loaded onto the GPU is a BERT model from Transformers.jl, which is only loaded onto the gpu when training or testing, but is offloaded back to the cpu when not in use.

All the code to create the BERT model is in a module called BERTModule with no global variables. I create a few BERT models in the main module global scope by calling those functions from the BERTModule module and I store them in global variables in the main module global scope. Then I train and predict using each of the models which causes my GPU memory usage to quickly increase. Then when I set all of the variables in the main module global scope to nothing and call CUDA.reclaim(), my GPU memory usage either drops a few tens or hundreds of MB's or not at all, nowhere close to its initial value.

Furthermore, when training the many BERT modules sequentially, I ran out of memory when training one, but then called CUDA.reclaim which reclaimed a small amount of GPU memory and then I tried training the model again and it worked. The error message from the REPL for this case is below. As you can see, I try training the model (training data/size remains constant), the GPU runs out of memory, but then when I call CUDA.reclaim() and try training the model again, it works.

These appear to be bugs because CUDA.reclaim() should not be necessary to explicitly call and because when setting all variables to nothing and calling reclaim(), the expected behaviour is for GPU memory usage to go back down to the resting GPU usage.

If it is relevant, currently I am using an Nvidia 1050ti.


julia> train_func(training_dict, bert_model)

[ Info: start training
[ Info: epoch: 1
┌ Info: training
│   loss = 0.79379857f0
└   accuracy = 0.3870967741935484
[ Info: epoch: 2
[ Info: epoch: 3
ERROR: Out of GPU memory trying to allocate 89.420 MiB
Effective GPU memory usage: 100.00% (4.000 GiB/4.000 GiB)
Memory pool usage: 1.631 GiB (3.312 GiB reserved)
Stacktrace:
  [1] macro expansion
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\pool.jl:320 [inlined]
  [2] macro expansion
    @ .\timing.jl:299 [inlined]
  [3] #_alloc#170
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\pool.jl:313 [inlined]
  [4] #alloc#169
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\pool.jl:299 [inlined]
  [5] alloc
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\pool.jl:295 [inlined]
  [6] CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64})
    @ CUDA C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\array.jl:42
  [7] similar
    @ C:\Users\jackn\.julia\packages\CUDA\tTK8Y\src\array.jl:166 [inlined]
  [8] similar
    @ .\abstractarray.jl:782 [inlined]
  [9] restructure(x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ ArrayInterfaceCore C:\Users\jackn\.julia\packages\ArrayInterfaceCore\nBDUl\src\ArrayInterfaceCore.jl:446
 [10] update!(opt::Flux.Optimise.Adam, x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, x̄::CuArray{Fl
oat32, 2, CUDA.Mem.DeviceBuffer})
    @ Flux.Optimise C:\Users\jackn\.julia\packages\Flux\KkC79\src\optimise\train.jl:16
 [11] update!(opt::Flux.Optimise.Adam, xs::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, gs::Zygote.Grads)
    @ Flux.Optimise C:\Users\jackn\.julia\packages\Flux\KkC79\src\optimise\train.jl:24
 [12] train(bert_model::Transformers.Basic.TransformerModel{Transformers.Basic.CompositeEmbedding{Float32, NamedTuple{(:tok, :segment, :pe), Tuple{Transformers.Basic.Embed{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Transformers.Basic.Embed{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, Transformers.Basic.PositionEmbedding{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}}, NamedTuple{(:tok, :segment, :pe), Tuple{typeof(+), typeof(+), typeof(+)}}, Transformers.Basic.Positionwise{Tuple{Flux.LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Flux.Dropout{Float64, Colon, CUDA.RNG}}}}, Transformers.BidirectionalEncoder.Bert{Transformers.Stacks.Stack{Symbol("((x, m) => x':(x, m)) => 12"), NTuple{12, Transformers.Basic.Transformer{Transformers.Basic.MultiheadAttention{Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dropout{Float64, Colon, CUDA.RNG}}, Flux.LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Transformers.Basic.PwFFN{Flux.Dense{typeof(NNlib.gelu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Flux.LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Flux.Dropout{Float64, Colon, CUDA.RNG}}}}, 
Flux.Dropout{Float64, Colon, CUDA.RNG}}, NamedTuple{(:pooler, :clf), Tuple{Flux.Dense{typeof(tanh), 
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Chain{Tuple{Flux.Dropout{Float64, Colon, CUDA.RNG}, Flux.Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, typeof(NNlib.logsoftmax)}}}}}, bertenc::Transformers.BidirectionalEncoder.BertTextEncoder{Transformers.Basic.TextTokenizer{Transformers.BidirectionalEncoder.WordPieceTokenization{Transformers.BidirectionalEncoder.BertUnCasedPreTokenization}}, TextEncodeBase.Vocab{String, StaticArraysCore.SizedVector{30522, String, Vector{String}}}, FuncPipelines.Pipelines{Tuple{FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{1, Base.Fix1{typeof(TextEncodeBase.nestedcall), typeof(Transformers.Basic.string_getvalue)}}}, FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, FuncPipelines.FixRest{typeof(Transformers.BidirectionalEncoder.with_firsthead_tail), Tuple{String, String}}}}}, FuncPipelines.Pipeline{(:tok, :segment), FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, typeof(Transformers.BidirectionalEncoder.segment_and_concat)}}}, FuncPipelines.Pipeline{:trunc_tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, FuncPipelines.FixRest{typeof(TextEncodeBase.trunc_and_pad), Tuple{Nothing, String}}}}}, FuncPipelines.Pipeline{:trunc_len, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:trunc_tok, 
typeof(TextEncodeBase.nestedmaxlength)}}}, FuncPipelines.Pipeline{:mask, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{(:tok, :trunc_len), typeof(Transformers.Basic.getmask)}}}, FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:trunc_tok, typeof(TextEncodeBase.nested2batch)}}}, FuncPipelines.Pipeline{:segment, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:segment, 
ComposedFunction{typeof(TextEncodeBase.nested2batch), FuncPipelines.FixRest{typeof(TextEncodeBase.trunc_and_pad), Tuple{Nothing, Int64}}}}}}, FuncPipelines.Pipeline{:input, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{(:tok, :segment), ComposedFunction{Type{NamedTuple{(:tok, :segment)}}, typeof(tuple)}}}}, FuncPipelines.PipeGet{(:input, :mask)}}}}, labels::TextEncodeBase.Vocab{String, StaticArraysCore.SizedVector{2, String, Vector{String}}}, training_dict::Dict{Int64, Vector{String}})        
    @ Main.Berts c:\Users\jackn\Documents\GitHub\GitHub2\Chat\NewBerts6.jl:100
 [13] (::Main.Berts.var"#generate_trainer#8"{TextEncodeBase.Vocab{String, StaticArraysCore.SizedVector{2, String, Vector{String}}}, Transformers.BidirectionalEncoder.BertTextEncoder{Transformers.Basic.TextTokenizer{Transformers.BidirectionalEncoder.WordPieceTokenization{Transformers.BidirectionalEncoder.BertUnCasedPreTokenization}}, TextEncodeBase.Vocab{String, StaticArraysCore.SizedVector{30522, 
String, Vector{String}}}, FuncPipelines.Pipelines{Tuple{FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{1, Base.Fix1{typeof(TextEncodeBase.nestedcall), typeof(Transformers.Basic.string_getvalue)}}}, FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, FuncPipelines.FixRest{typeof(Transformers.BidirectionalEncoder.with_firsthead_tail), Tuple{String, String}}}}}, FuncPipelines.Pipeline{(:tok, :segment), FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, typeof(Transformers.BidirectionalEncoder.segment_and_concat)}}}, FuncPipelines.Pipeline{:trunc_tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:tok, FuncPipelines.FixRest{typeof(TextEncodeBase.trunc_and_pad), Tuple{Nothing, String}}}}}, FuncPipelines.Pipeline{:trunc_len, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:trunc_tok, typeof(TextEncodeBase.nestedmaxlength)}}}, FuncPipelines.Pipeline{:mask, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{(:tok, :trunc_len), typeof(Transformers.Basic.getmask)}}}, FuncPipelines.Pipeline{:tok, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:trunc_tok, typeof(TextEncodeBase.nested2batch)}}}, FuncPipelines.Pipeline{:segment, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{:segment, ComposedFunction{typeof(TextEncodeBase.nested2batch), FuncPipelines.FixRest{typeof(TextEncodeBase.trunc_and_pad), Tuple{Nothing, Int64}}}}}}, FuncPipelines.Pipeline{:input, FuncPipelines.ApplyN{2, FuncPipelines.ApplySyms{(:tok, :segment), ComposedFunction{Type{NamedTuple{(:tok, :segment)}}, typeof(tuple)}}}}, FuncPipelines.PipeGet{(:input, :mask)}}}}})(train_dict::Dict{Int64, Vector{String}}, bert_model_container::Main.Berts.BertModelContainer)
    @ Main.Berts c:\Users\jackn\Documents\GitHub\GitHub2\Chat\NewBerts6.jl:291
 [14] top-level scope
    @ c:\Users\jackn\Documents\GitHub\GitHub2\Chat\NewBerts6.jl:327

julia> CUDA.reclaim()

julia> train_func(training_dict, bert_model)

[ Info: start training
[ Info: epoch: 1
┌ Info: training
│   loss = 0.78859687f0
└   accuracy = 0.41935483870967744
[ Info: epoch: 2
[ Info: epoch: 3
[ Info: epoch: 4
[ Info: epoch: 5
[ Info: epoch: 6
[ Info: epoch: 7
[ Info: epoch: 8
[ Info: epoch: 9
[ Info: epoch: 10
[ Info: epoch: 11
[ Info: epoch: 12
[ Info: epoch: 13
[ Info: epoch: 14
[ Info: epoch: 15
[ Info: epoch: 16
[ Info: epoch: 17
┌ Info: training
│   loss = 0.6262033f0
└   accuracy = 0.6451612903225806
[ Info: epoch: 18
[ Info: epoch: 19
[ Info: epoch: 20
[ Info: epoch: 21
[ Info: epoch: 22
[ Info: epoch: 23
[ Info: epoch: 24
[ Info: epoch: 25
[ Info: epoch: 26
[ Info: epoch: 27
[ Info: epoch: 28
[ Info: epoch: 29
[ Info: epoch: 30
[ Info: epoch: 31
[ Info: epoch: 32
[ Info: epoch: 33
┌ Info: training
│   loss = 0.63867134f0
└   accuracy = 0.6129032258064516
[ Info: epoch: 34
[ Info: epoch: 35
[ Info: epoch: 36
[ Info: epoch: 37
[ Info: epoch: 38
[ Info: epoch: 39
[ Info: epoch: 40
[ Info: epoch: 41
[ Info: epoch: 42
[ Info: epoch: 43
[ Info: epoch: 44
[ Info: epoch: 45
[ Info: epoch: 46
[ Info: epoch: 47
[ Info: epoch: 48
[ Info: epoch: 49
┌ Info: training
│   loss = 0.47331774f0
└   accuracy = 0.9354838709677419
[ Info: epoch: 50
[ Info: epoch: 51
[ Info: epoch: 52
[ Info: epoch: 53
[ Info: epoch: 54
[ Info: epoch: 55
[ Info: epoch: 56
[ Info: epoch: 57
[ Info: epoch: 58
[ Info: epoch: 59
[ Info: epoch: 60
[ Info: epoch: 61
[ Info: epoch: 62
[ Info: epoch: 63
[ Info: epoch: 64
[ Info: epoch: 65
┌ Info: training
│   loss = 0.5309651f0
└   accuracy = 0.6774193548387096
[ Info: epoch: 66
[ Info: epoch: 67
[ Info: epoch: 68
[ Info: epoch: 69
[ Info: epoch: 70
[ Info: epoch: 71
[ Info: epoch: 72
[ Info: epoch: 73
[ Info: epoch: 74
[ Info: epoch: 75
[ Info: epoch: 76
[ Info: epoch: 77
[ Info: epoch: 78
[ Info: epoch: 79
[ Info: epoch: 80
[ Info: epoch: 81
┌ Info: training
│   loss = 0.4070098f0
└   accuracy = 0.8387096774193549
[ Info: epoch: 82
[ Info: epoch: 83
[ Info: epoch: 84
[ Info: epoch: 85
[ Info: epoch: 86
[ Info: epoch: 87
[ Info: epoch: 88
[ Info: epoch: 89
[ Info: epoch: 90
[ Info: epoch: 91
[ Info: epoch: 92
[ Info: epoch: 93
[ Info: epoch: 94
[ Info: epoch: 95
[ Info: epoch: 96
[ Info: epoch: 97
┌ Info: training
│   loss = 0.37719026f0
└   accuracy = 0.9032258064516129
[ Info: epoch: 98
[ Info: epoch: 99
[ Info: epoch: 100
[ Info: testing
┌ Info: testing
└   accuracy = 0.967741935483871

jackn11 avatar Jul 12 '22 17:07 jackn11

Is it possible that storing the compiled functions from BERTModule are taking up the storage on the GPU? If so, is there a way I can clear some of those functions from memory?

jackn11 avatar Jul 12 '22 17:07 jackn11

Memory handling and GC integration has changed significantly, so I don't think this issue as reported here is still relevant. If the problem persists on CUDA.jl#master, feel free to open a new issue!

maleadt avatar Apr 27 '24 17:04 maleadt