Dagger.jl icon indicating copy to clipboard operation
Dagger.jl copied to clipboard

DArray: Memory cannot GC when the variables are reassigned, causing `Out of Memory` error or halting problem

Open islent opened this issue 4 months ago • 3 comments

MWE:

using DistributedNext
addprocs(8)
@everywhere using Dagger

N = 1024
a = zeros(N,N,N);   # 8GB

b = DArray(a);  #! Repeat this, and the CPU memory keeps growing until the Julia session is killed by system

I have tried the followings but the memory occupation cannot decrease:

b = nothing;

GC.gc()

@everywhere GC.gc()

Is there a way to reuse or manually release the memory of DArray? I cannot find any solution in the documentation or github issues.

islent avatar Aug 10 '25 15:08 islent

I have compared the source codes of implementing Dagger.DArray and DistributedArrays.DArray: DistributedArrays.jl implements a finalizer with Base.close(d::DArray), so that the memory is able to GC when there is no variable referencing to this DistributedArrays.DArray.

islent avatar Aug 10 '25 15:08 islent

This might be happening because of our distributed refcounting logic in MemPool.jl. Dagger's DArray is implemented to work like a normal Array, such that it cleans itself up when no longer in use, but it also keeps track of references on remote workers. This can be really convenient when it works, but really annoying when it doesn't.

We recently changed MemPool.MEM_RESERVED[] = 0 (which disables the aggressive auto-GC behavior that it employs) because it was causing huge performance issues on constrained environments like laptops. You might try setting that to something like 10^9 to reserve about 1GB of space, which will force MemPool to more aggressively invoked the GC when less than 1GB of memory remains.

Regarding the DistributedArrays comparison - in theory we work in a similar way (such that once all references are inaccessible, the DArray can be deleted), but this doesn't always work cleanly, as we've seen. I'd be open to a PR to add close support to both DArrays and DTasks, as a promise from the user that the values contained by these objects will no longer be used. We'd first need to wait for their computations to finish (because Dagger will be tracking them internally), but then we can forcibly clean up their values, and even inject a small "poison" value that ensures that accidental use-after-close attempts result in an appropriate error. Basically it would look like this for DTask:

function Base.close(task::DTask)
  # Wait for the task to finish
  wait(task)
  
  # Replace the current future with a new one
  task.future = ThunkFuture()

  # Set the new future to a poison value to prevent misuse
  put!(task.future, ConcurrencyViolationError("Cannot fetch the result of a closed DTask"); error=true)

  remotecall_wait(1, task) do task
    # Get the Thunk associated with this DTask
    thunk = Dagger.Sch._find_thunk(task)

    # Tell the scheduler that this task is not user-accessible anymore
    thunk.eager_accessible = false

    # Clear out the cached value to let it be GC'd
    # TODO: Set the Thunk into an error state, in case another Thunk has it as a dependency?
    thunk.cache_ref = nothing
  end
end

Doing the same for a DArray is basically just doing close on all of A.chunks, where the element is a DTask. You could then do fill!(A.chunks, nothing) to ensure anything that's not a DTask can be GC'd.

jpsamaroo avatar Aug 11 '25 22:08 jpsamaroo

Thanks for helping!

However, I tried @everywhere Dagger.MemPool.MEM_RESERVED[] = 1e9 and 1e10 and even 1e11, the problem is not solved

islent avatar Aug 12 '25 13:08 islent