YAXArrays.jl
YAXArrays.jl copied to clipboard
out of memory when using Distributed
Is unclear to me why the following runs out of memory. The yaxarray that I'm using is not that big. Is it being copied/transfer to each worker? If so, it seems to be inefficient and using a similar approach to SharedArrays[is not like that already? 😕] probably will be better, if it is remotely an option.
Example adapted from here.
using Distributed
addprocs(7)
@everywhere using Pkg
@everywhere Pkg.activate(".")
@everywhere using YAXArrays
@everywhere using Statistics
@everywhere using Zarr
@everywhere function mymean(output, pixel)
output = mean(pixel)
end
axlist = [
RangeAxis("time", range(1, 20, length=2000)),
RangeAxis("x", range(1, 10, length=200)),
RangeAxis("y", range(1, 5, length=200)),
CategoricalAxis("Variable", ["var1", "var2"])]
data = rand(2000, 200, 200, 2);
ds = YAXArray(axlist, data)
indims = InDims("Time")
outdims = OutDims()
resultcube = mapCube(mymean, ds, indims=indims, outdims=outdims)
As already mentioned yesterday, since every worker is using some memory, this can happen so you should try to set the kwarg max_cache to something smaller lik eg max_cache=1e8.
Please notice that in your test case the whole input array lives in memory and is not chunked, so it should always be better to start with some data living on disk (like Zarr or NetCDF) if you want to test the package in a real-world case. For parallel processing of in-memory data there are more efficient packages, like DistributedArrays, SharedArrays or similar (as you mentioned yesterday)
Also your function does not do what you expect it should be output[:] = mean(pixel) instead, otherwise you don't return anything.
I have similar problems. I tried a simple mapslices(sum,cube,dims="time", max_cache=1e8) this needs some time since the cube is not that small, but it doesn't free the used memory. So no error here, but no memory left after 5% ETA. I have a chunked cube and using 10 processes on a machine with 100GB RAM (the same for only 2 procs). Any ideas what I can try?
My problem seems to be a garbage collection issue under julia 1.9. With julia 1.8.5 there is no such problem.
@TabeaW yes, we are seeing GC-related for long-running IO-intensive jobs as well, where memory usage just keeps increasing. However, it does not seem like there is a memory leak, i.e. manually calling GC.gc() at some places in the code helps. On some machines we could improve the situation by starting julia with the heap-size-hint argument, but that does not seem to be a fix in all cases. I will make a branch where we add some explicit gc() calls in the main loop, would be interesting to know if that helps your use case.
Yes I tried that already, that helped a lot, but increased the needed time on the other hand.
Ok, I guess you called gc manually inside the function you passed to mapslices? Maybe this branch https://github.com/JuliaDataCubes/YAXArrays.jl/pull/265 does not have such a big performance impact, because GC is only triggered after processing a block of data, and not after every function call.