arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Arrow.write using a lot of memory to save record batch

Open kobusherbst opened this issue 3 years ago • 0 comments

I am using this function to save a batch Arrow files into a single Arrow file with record batches:

function combinebatches(path::String, file::String, batches)
    files = Array{String,1}()
    for i = 1:batches
        push!(files,joinpath(path, "$(file)$(i).arrow"))
    end
    arrow_parts = Tables.partitioner(Arrow.Table, files)
    open(joinpath(path, "$(file)_batched.arrow"), "w") do io
        Arrow.write(io, arrow_parts, compress=:zstd)
    end
    #delete chunks
    for i = 1:batches
        rm(joinpath(path, "$(file)$(i).arrow"))
    end
    return nothing
end #combinebatches

When saving about 25 uncompressed arrow files with an average size of 1.8 GB each, required a RAM size of around 50GB. As I am partitioning my data to save RAM, this is creating a problem with the high RAM usage (in this case > 50GB). I was expecting that one would not require more than the maximum size of each single file (or perhaps number of threads x file size) to save the record batch.

JULIA_NUM_THREADS is set to 4.

kobusherbst avatar Sep 11 '21 10:09 kobusherbst