MXNet.jl Julia garbage collector does not reclaim GPU memory

I'm trying to write Neural Style in MXNet.jl, and I keep running out of memory when I try to make new executors (and delete the old ones). My basic strategy is to store the executor in an exec variable and do

exec = 0 gc()

when I want to reclaim GPU memory for that executor. This does not work as expected, as I am tracking CUDA memory usage with nvidia-smi and there is never a drop in memory usage after calling gc().

Does anyone know of a way to reclaim GPU memory? Here is my code for reference: https://github.com/bonsairobo/mxnet-neural-style/blob/master/stylenet.jl

Apr 24 '16 20:04 bonsairobo

GC is really unpredictable, I guess the generational GC is retaining some of the objects because they are still young? Maybe you can try to explicitly call the destructor like mx.delete!(exec.handle).

Apr 24 '16 22:04 pluskid

How do I import mx.delete!? It seems like a private API.

Apr 24 '16 23:04 bonsairobo

Probably cannot call it directly. How about calling finalize(exec.handle)?

Apr 24 '16 23:04 pluskid

See #84

Apr 24 '16 23:04 vchuravy

I tried

mx.finalize(x.handle)
x = 0
gc()

and the GPU memory is still allocated.

Apr 24 '16 23:04 bonsairobo

Mxnet has its own internal memory pool, that retains memory for future arrays because cuda allocation is slow. So the memory goes back to the pool, but are not freed to nvidia's runtime

On Sun, Apr 24, 2016 at 4:16 PM Duncan Fairbanks [email protected] wrote:

I tried

mx.finalize(x.handle) x = 0 gc()

and the GPU memory is still allocated.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/dmlc/MXNet.jl/issues/85#issuecomment-214064727

Apr 25 '16 00:04 tqchen

Oh that helps my understanding! What is the policy for re-using memory in the pool? E.g. what if I finalize a chunk of memory then ask for a larger chunk. Would the older chunk be reused? Would the pool ever return memory to CUDA to ask for a larger contiguous chunk?

The reason I ask is that I am trying to create two different executors of the same network corresponding to different input sizes. I know I have enough memory to support either input size separately, but I cannot figure out how to allocate them both at mutually exclusive times in my code.

Sorry if this is a lot of questions. I can also take a look at the mxnet engine code if it is easily comprehensible to a non-DMLC member.

Apr 25 '16 00:04 bonsairobo

There are two factors in executor memory consumption.

The executor itself tries to retain and share memory between nodes without runtime re-allocation.
- The memory sharing within an executor pool can re-use memory of different size, as it is a static allocation strategy and does not require fast runtime.
The imperative API uses exact size matching as memory pool for speed reason, so the memory of different size won't be re-used in current strategy, unless the memory requirement hits an wall and free re-allocation happens. https://github.com/dmlc/mxnet/blob/master/src/storage/pooled_storage_manager.h

If you are using two executors exclusively, there is a support for memory sharing between executors, e.g. bucketing API, which is currently supported in python. You can bind the executor with larger input size, and share its memory with the smaller executor in that setting.

Apr 25 '16 00:04 tqchen

That's good to know about the Python memory sharing. I'm going to stick with the Julia API for now.

I cannot seem to reuse an old (no longer needed) executor's GPU memory for a new executor, even after finalizing the handles. I think a simple API to explicitly free GPU memory would be very helpful (even if less performant) in some scenarios.

For now, I am going to make all input data the same size through resizing. This may have adverse effects on the results, but it will likely be negligible.

Apr 25 '16 01:04 bonsairobo

MXNet.jl MXNet.jl copied to clipboard

Julia garbage collector does not reclaim GPU memory

MXNet.jl
MXNet.jl copied to clipboard