JuliaDB.jl icon indicating copy to clipboard operation
JuliaDB.jl copied to clipboard

Memory out of control because of MemPool no longer maintained?

Open bearxu83 opened this issue 6 years ago • 8 comments
trafficstars

I tried several times to see if the total memory usage is smaller than the worker number * chunk size. Basically, it has NO limit.

I tried to use MemPool.max_memsize to set the limit. But it seems that MemPool doesn't pass the unit tests.

So I suppose the out-of-core usage is broken now. It that right?

bearxu83 avatar Feb 27 '19 22:02 bearxu83

MemPool is maintained and passes tests (https://travis-ci.org/JuliaComputing/MemPool.jl). I've also used JuliaDB out-of-core within the last week.

What did you try and what was broken?

joshday avatar Feb 28 '19 00:02 joshday

In my test of Julia 1.0.3 and Windows10. After reduce, all the contents of the selected column will leave in memory. The total memory usage is not (number of worker * each chunk size) written in doc but is (total selected column size in all chunks).

I try to use "@everywhere MemPool.max_memsize[] = 10^9 # 1 GB per worker" in MemPool Doc to limit the memory usage but failed.

bearxu83 avatar Feb 28 '19 02:02 bearxu83

Do you have some code I can use to reproduce the issue? I don't completely follow what occurred. It sounds like while using reduce and setting MemPool.max_memsize, your memory use grew so that you think an entire column (from all workers) was moved to memory?

Off-topic: I think you accidentally tagged another user. Try to put code in backticks: @everywhere.

joshday avatar Feb 28 '19 12:02 joshday

What I mean is that:

When we need to compute something, all the contents of the chunks will be written to the MemPool. For example in https://github.com/JuliaParallel/Dagger.jl/blob/master/src/chunks.jl, line 74, function collect(ctx::Context, ref::Union{DRef, FileRef}) poolget(ref) end

After collect, the contents of the chunks are put into the pool and stay in memory. MemPool is designed to be able to control memory usage by setting MemPool.max_memsize. But the code is commented. See https://github.com/JuliaComputing/MemPool.jl/blob/master/src/datastore.jl, line 276.

So now from my understanding, the out-of-core processing function is broken.

bearxu83 avatar Feb 28 '19 17:02 bearxu83

Good detective work! The thing that I commented out is spill-to-disk functionality which was more headache than worth it.

However if you're just doing a reduce then it should only read the memory mapped data, and be done.

Can you post the output of

Dagger.collect(Dagger.delayed(typeof)(tbl.chunks[1]))

I am wondering how much of your data is memory-mapped.

One more thing that happens is any intermediate data should get gc'd. The out-of-coreness only breaks if you're creating a new big table at top level (e.g. doing a select).

If you want to do select just for the sake of reducing immediately, try doing reduce(..., select=(selector...)) instead!

shashi avatar Feb 28 '19 23:02 shashi

Thanks for explaining.

I happened to have a table in which a JSON object is serialized in one of the columns. So the size of the column is very large. julia> Dagger.collect(Dagger.delayed(typeof)(dff.chunks[1])) IndexedTable{StructArrays.StructArray{NamedTuple{(:channelGrouping, :customDimensions, :date, :device, :fullVisitorId, :geoNetwork, :hits, :socialEngagementType, :totals, :trafficSource, :visitId, :visitNumber, :visitStartTime),Tuple{String,String,Int64,String,Float64,String,String,String,String,String,Int64,Int64,Int64}},1,NamedTuple{(:channelGrouping, :customDimensions, :date, :device, :fullVisitorId, :geoNetwork, :hits, :socialEngagementType, :totals, :trafficSource, :visitId, :visitNumber, :visitStartTime),Tuple{WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},Array{Float64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},Array{Int64,1},Array{Int64,1},Array{Int64,1}}}}}

When do the select, JuliaDB is column-based, most of time leaving the numerical columns in memory is OK. so could we add disposable=true in somewhere to allow us not to cache the specific column of the chunk.

Do you know how could I "close" the DIndexedTable and release the memory? I think this is important.

I just wrote a function to allow JuliaDB to import from one or serveral giant CSV files. It seems that I could do the table transform before it is sinked to disk. I will release it soon.

Thanks again.

bearxu83 avatar Mar 01 '19 01:03 bearxu83

@shashi can you email me Chuck ? This mempool issue has went up to the CFO of my customer. Hoping you can help

cwiese avatar Apr 02 '19 21:04 cwiese

I just wrote a function to allow JuliaDB to import from one or serveral giant CSV files. It seems that I could do the table transform before it is sinked to disk. I will release it soon.

This sounds great! We don't yet handle a single big CSV file well.

shashi avatar Apr 05 '19 05:04 shashi