arrow-julia
arrow-julia copied to clipboard
Arrow.write using a lot of memory to save record batch
I am using this function to save a batch Arrow files into a single Arrow file with record batches:
function combinebatches(path::String, file::String, batches)
files = Array{String,1}()
for i = 1:batches
push!(files,joinpath(path, "$(file)$(i).arrow"))
end
arrow_parts = Tables.partitioner(Arrow.Table, files)
open(joinpath(path, "$(file)_batched.arrow"), "w") do io
Arrow.write(io, arrow_parts, compress=:zstd)
end
#delete chunks
for i = 1:batches
rm(joinpath(path, "$(file)$(i).arrow"))
end
return nothing
end #combinebatches
When saving about 25 uncompressed arrow files with an average size of 1.8 GB each, required a RAM size of around 50GB. As I am partitioning my data to save RAM, this is creating a problem with the high RAM usage (in this case > 50GB). I was expecting that one would not require more than the maximum size of each single file (or perhaps number of threads x file size) to save the record batch.
JULIA_NUM_THREADS is set to 4.