YAXArrays.jl out of memory when using Distributed

Is unclear to me why the following runs out of memory. The yaxarray that I'm using is not that big. Is it being copied/transfer to each worker? If so, it seems to be inefficient and using a similar approach to SharedArrays[is not like that already? 😕] probably will be better, if it is remotely an option. Example adapted from here.

using Distributed
addprocs(7)

@everywhere using Pkg
@everywhere Pkg.activate(".")
@everywhere using YAXArrays
@everywhere using Statistics
@everywhere using Zarr
@everywhere function mymean(output, pixel)
       output = mean(pixel)
end

axlist = [
    RangeAxis("time", range(1, 20, length=2000)),
    RangeAxis("x", range(1, 10, length=200)),
    RangeAxis("y", range(1, 5, length=200)),
    CategoricalAxis("Variable", ["var1", "var2"])]
data = rand(2000, 200, 200, 2);
ds = YAXArray(axlist, data)

indims = InDims("Time")
outdims = OutDims()

resultcube = mapCube(mymean, ds, indims=indims, outdims=outdims)

Jul 19 '22 13:07 lazarusA

As already mentioned yesterday, since every worker is using some memory, this can happen so you should try to set the kwarg max_cache to something smaller lik eg max_cache=1e8.

Please notice that in your test case the whole input array lives in memory and is not chunked, so it should always be better to start with some data living on disk (like Zarr or NetCDF) if you want to test the package in a real-world case. For parallel processing of in-memory data there are more efficient packages, like DistributedArrays, SharedArrays or similar (as you mentioned yesterday)

Also your function does not do what you expect it should be output[:] = mean(pixel) instead, otherwise you don't return anything.

Jul 20 '22 12:07 meggart

I have similar problems. I tried a simple mapslices(sum,cube,dims="time", max_cache=1e8) this needs some time since the cube is not that small, but it doesn't free the used memory. So no error here, but no memory left after 5% ETA. I have a chunked cube and using 10 processes on a machine with 100GB RAM (the same for only 2 procs). Any ideas what I can try?

Jun 21 '23 07:06 TabeaW

My problem seems to be a garbage collection issue under julia 1.9. With julia 1.8.5 there is no such problem.

Jun 22 '23 13:06 TabeaW

@TabeaW yes, we are seeing GC-related for long-running IO-intensive jobs as well, where memory usage just keeps increasing. However, it does not seem like there is a memory leak, i.e. manually calling GC.gc() at some places in the code helps. On some machines we could improve the situation by starting julia with the heap-size-hint argument, but that does not seem to be a fix in all cases. I will make a branch where we add some explicit gc() calls in the main loop, would be interesting to know if that helps your use case.

Jun 23 '23 06:06 meggart

Yes I tried that already, that helped a lot, but increased the needed time on the other hand.

Jun 23 '23 06:06 TabeaW

Ok, I guess you called gc manually inside the function you passed to mapslices? Maybe this branch https://github.com/JuliaDataCubes/YAXArrays.jl/pull/265 does not have such a big performance impact, because GC is only triggered after processing a block of data, and not after every function call.

Jun 23 '23 06:06 meggart

YAXArrays.jl YAXArrays.jl copied to clipboard

out of memory when using Distributed

YAXArrays.jl
YAXArrays.jl copied to clipboard