JuliaDB.jl
JuliaDB.jl copied to clipboard
Memory out of control because of MemPool no longer maintained?
I tried several times to see if the total memory usage is smaller than the worker number * chunk size. Basically, it has NO limit.
I tried to use MemPool.max_memsize to set the limit. But it seems that MemPool doesn't pass the unit tests.
So I suppose the out-of-core usage is broken now. It that right?
MemPool is maintained and passes tests (https://travis-ci.org/JuliaComputing/MemPool.jl). I've also used JuliaDB out-of-core within the last week.
What did you try and what was broken?
In my test of Julia 1.0.3 and Windows10. After reduce, all the contents of the selected column will leave in memory. The total memory usage is not (number of worker * each chunk size) written in doc but is (total selected column size in all chunks).
I try to use "@everywhere MemPool.max_memsize[] = 10^9 # 1 GB per worker" in MemPool Doc to limit the memory usage but failed.
Do you have some code I can use to reproduce the issue? I don't completely follow what occurred. It sounds like while using reduce and setting MemPool.max_memsize, your memory use grew so that you think an entire column (from all workers) was moved to memory?
Off-topic: I think you accidentally tagged another user. Try to put code in backticks: @everywhere.
What I mean is that:
When we need to compute something, all the contents of the chunks will be written to the MemPool.
For example in https://github.com/JuliaParallel/Dagger.jl/blob/master/src/chunks.jl, line 74,
function collect(ctx::Context, ref::Union{DRef, FileRef}) poolget(ref) end
After collect, the contents of the chunks are put into the pool and stay in memory.
MemPool is designed to be able to control memory usage by setting MemPool.max_memsize.
But the code is commented.
See https://github.com/JuliaComputing/MemPool.jl/blob/master/src/datastore.jl, line 276.
So now from my understanding, the out-of-core processing function is broken.
Good detective work! The thing that I commented out is spill-to-disk functionality which was more headache than worth it.
However if you're just doing a reduce then it should only read the memory mapped data, and be done.
Can you post the output of
Dagger.collect(Dagger.delayed(typeof)(tbl.chunks[1]))
I am wondering how much of your data is memory-mapped.
One more thing that happens is any intermediate data should get gc'd. The out-of-coreness only breaks if you're creating a new big table at top level (e.g. doing a select).
If you want to do select just for the sake of reducing immediately, try doing reduce(..., select=(selector...)) instead!
Thanks for explaining.
I happened to have a table in which a JSON object is serialized in one of the columns. So the size of the column is very large.
julia> Dagger.collect(Dagger.delayed(typeof)(dff.chunks[1])) IndexedTable{StructArrays.StructArray{NamedTuple{(:channelGrouping, :customDimensions, :date, :device, :fullVisitorId, :geoNetwork, :hits, :socialEngagementType, :totals, :trafficSource, :visitId, :visitNumber, :visitStartTime),Tuple{String,String,Int64,String,Float64,String,String,String,String,String,Int64,Int64,Int64}},1,NamedTuple{(:channelGrouping, :customDimensions, :date, :device, :fullVisitorId, :geoNetwork, :hits, :socialEngagementType, :totals, :trafficSource, :visitId, :visitNumber, :visitStartTime),Tuple{WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},Array{Float64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},Array{Int64,1},Array{Int64,1},Array{Int64,1}}}}}
When do the select, JuliaDB is column-based, most of time leaving the numerical columns in memory is OK. so could we add disposable=true in somewhere to allow us not to cache the specific column of the chunk.
Do you know how could I "close" the DIndexedTable and release the memory? I think this is important.
I just wrote a function to allow JuliaDB to import from one or serveral giant CSV files. It seems that I could do the table transform before it is sinked to disk. I will release it soon.
Thanks again.
@shashi can you email me Chuck ? This mempool issue has went up to the CFO of my customer. Hoping you can help
I just wrote a function to allow JuliaDB to import from one or serveral giant CSV files. It seems that I could do the table transform before it is sinked to disk. I will release it soon.
This sounds great! We don't yet handle a single big CSV file well.