JLD.jl icon indicating copy to clipboard operation
JLD.jl copied to clipboard

Julia slows down when reading thousands of JLD files

Open gasagna opened this issue 8 years ago • 19 comments

This issue has been reported initially at JuliaLang/julia#17554. Discussion can continue here.

gasagna avatar Jul 23 '16 10:07 gasagna

(Friendly bump) Is there any chance this issue can be easily addressed or worked around? I have several tens of thousand .jld files and reading data from all of those takes ages, unnecessarily.

I might consider temporarily switching to a different storage format, until this bug/feature is fixed. Any ideas on alternatives? Files are within 1MB and they contain less than 5 arrays each.

gasagna avatar Aug 01 '16 17:08 gasagna

I don't think we know yet what causes this. I hope there is a straightforward way to change the way we use the HDF5 library that removes this problem (explicitly calling the HDF5 garbage collector? using the "strong closing" flag when opening the file? updating to HDF5 1.10?), but these are just guesses.

What other format are you contemplating? CSV?

eschnett avatar Aug 01 '16 17:08 eschnett

Thanks for chiming in.

Unfortunately I am out of ideas for alternative formats... Maybe using numpy .npy via PyCall, but I have not tested it yet. It has to be binary as I have a ~ 100 GB dataset that would be too large in ascii files.

gasagna avatar Aug 01 '16 17:08 gasagna

The problem seems to be that there are so many files. Can you get by with fewer files? I would consider putting all 100 GByte into the same HDF5 file.

If your data have a simple format (e.g. matrix, array), then you can write them directly into an HDF5 file, circumventing JLD.

eschnett avatar Aug 01 '16 18:08 eschnett

The problem seems to be a libhdf5 problem, however.

@gasagna, you might have to do a bit of legwork. What happens if you comment out the problematic block we discussed at https://github.com/JuliaLang/julia/issues/17554?

Also, upgrading to a new libhdf5 seems likely to help---they have apparently ironed out a bunch of problems.

timholy avatar Aug 01 '16 19:08 timholy

@eschnett I could store everything in a large file, but I do not want to do that for several reasons, e.g. avoid massive data corruption, being able to delete/move/reorganise files easily, ... (Does HDF.jl allows deleting a dataset in a file? is the associated storage removed from the file, i.e. file size decreases?)

@timholy Will try your suggestions asap.

gasagna avatar Aug 02 '16 09:08 gasagna

@gasagna Yes to both, with HDF5 1.10.

eschnett avatar Aug 02 '16 12:08 eschnett

@garrison There are also efficient tools to convert from/to older HDF5 file formats (if you encounter an old tool that needs to use the data), as well as tools to reorganize the data, e.g. transparently adding compression, chunking, different internal representations, etc.

eschnett avatar Aug 02 '16 14:08 eschnett

Isn't 1.10 backward compatible?

gasagna avatar Aug 02 '16 15:08 gasagna

1.10 can use a new file format that is more efficient. 1.10 can of course read the 1.8 format, but if you are using the new features, then the 1.8 library can't read the respective parts of the file. There's a conversion tool that downgrades the file (in place, apparently just rewriting some metadata, very fast). When you create a file with the 1.10 version, then you can decide whether to avoid (by setting a flag) any new 1.10 features or not.

In other words: The 1.8 format has certain limitations that are corrected by the 1.10 library. You have to choose between backward compatibility or the new features.

eschnett avatar Aug 02 '16 18:08 eschnett

Cool, Thanks!

gasagna avatar Aug 02 '16 18:08 gasagna

Ok, back from vacations...

Here is the script I used for timing and profiling.

using JLD

# create `Nfiles`  JLD files in `dir`, each containing some small amount of data
function bar(Nfiles, dir)
    !ispath(dir) && mkdir(dir)
    for i = 1:Nfiles
        f = @sprintf "%05d.jld" i
        JLD.save(joinpath(dir, f), "X", randn(5, 5))
        print("\r ", i)
    end
end

# Read `Nfiles` in `Nbatch` batches and evaluate the time required to load each batch. 
# Return a vector of times
function foo(Nfiles, Nbatch, dir)
   M = div(Nfiles, Nbatch)
   s = 0.0
   times = Float64[]
   for j = 1:Nbatch
       I = (j-1)*M + 1
       t = @elapsed for i = I:(I+M-1)
           f = @sprintf "%s/%05d.jld" dir i
           X = JLD.load(f, "X")::Matrix{Float64}
           s += sum(X)
       end
       push!(times, t)
       print("\r $j $t")
   end
   println()
   times
end

# ~ Make files in `tmp`, then comment
# bar(10000, "jld-tmp")

# ~ Now read files in batches
# precompile
foo(1, 1, "jld-tmp");

# ~ Time/Profile code, comment as necessary...
# @time foo(10000, 50, "jld-tmp");
@profile foo(10000, 50, "jld-tmp");
Profile.print()

Here are the results of timing the code five time, each time from scratch, where patched stands for JLD.jl with lines 146-148 commented out as recommended by Tim.


# Times over five runs
original = [4.75, 4.61, 4.36, 4.45, 4.42]
patched  = [4.62, 4.37, 4.45, 4.42, 4.42]

As can you see, no big difference...

Now look at the relevant parts of the profiling sessions

# Profiling output

# original
  3322 ./client.jl; _start; line: 378 
  ... 
  stuff
  ...
   2086 ...0.4/JLD/src/JLD.jl; jldopen; line: 267
   1    ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2208
   1    ....4/JLD/src/JLD.jl; close; line: 136
   3    ....4/JLD/src/JLD.jl; close; line: 138
    1 dict.jl; skip_deleted; line: 793
   4    ....4/JLD/src/JLD.jl; close; line: 143
   1533 ....4/JLD/src/JLD.jl; close; line: 146
    1528 ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2201
    5    ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2205
   4    ....4/JLD/src/JLD.jl; close; line: 147
    4 .../HDF5/src/plain.jl; h5o_close; line: 2006
   537  ....4/JLD/src/JLD.jl; close; line: 151
    536 ...HDF5/src/plain.jl; h5f_close; line: 2006
    1    ....4/JLD/src/JLD.jl; close; line: 156

# patched
  3521 ./client.jl; _start; line: 378
  ...
  stuff
  ...
    2272 ...0.4/JLD/src/JLD.jl; jldopen; line: 267
     2    ...0.4/JLD/src/JLD.jl; close; line: 143
     2269 ...0.4/JLD/src/JLD.jl; close; line: 151
      2266 ...HDF5/src/plain.jl; h5f_close; line: 2006

There is something strange going on here. Over a total of about 3300-3500 samples collected in the two cases, about the same amount, ~ 2100-2200, goes to jldopen at line 267, as you would expect from the similar timings. However, for the patched version, all the time is spent at line 151, whereas for the original version, only a fourth of the samples goes to this line.

Either profiling the C code does something nasty, or there is a real shift in where time is spent when those three lines are commented out.

Edit added versionifo

Julia Version 0.4.7-pre+1
Commit 57d0834* (2016-06-19 17:17 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.5.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

gasagna avatar Aug 19 '16 18:08 gasagna

I have tested the code in the original issue on a Linux machine with hdf 1.10. Same behaviour occurs.

gasagna avatar Aug 30 '16 17:08 gasagna

I have tried using python h5py to read the same files and see if the issue is in HDF or HDF.jl.

The julia code to write and read the files is

using JLD
using Iterators

function make_files(Nfiles, dir)
    !ispath(dir) && mkdir(dir)
    for i = 1:Nfiles
        f = @sprintf "%05d.jld" i
        JLD.save(joinpath(dir, f), "X", randn(5, 5))
        print("\r ", i)
    end
    println()
end

function read_files(dir, N)
   files = readdir(dir)
   s = 0.0
   for chunk in partition(files, N)
       t = @elapsed for file in chunk
           X = JLD.load(joinpath(dir, file), "X")::Matrix{Float64}
           s += sum(X)
       end
       @printf "%.5f\n" t
   end
   s
end

# make_files(10000, "jld-tmp")
println(read_files("jld-tmp", 500))

The python code is

import numpy as np
import os
import h5py
import time

def chunks(list, N):
    for i in xrange(0, len(list), N):
        yield list[i:i+N]

def read_files(N, dir):
    files = os.listdir(dir)
    s = 0.0
    for chunk in chunks(files, N):
        tic = time.time()
        for file in chunk:
            fh = h5py.File(os.path.join(dir, file), "r")
            X = fh["X"]
            s += np.sum(X)
        toc = time.time()
        print "%.5f" % (toc-tic)

    return s

print read_files(500, "jld-tmp")

Timing are as follows:

Julia

0.57577
0.13256
0.15173
0.17690
0.18536
0.18588
0.17711
0.19230
0.19699
0.19837
0.21121
0.21436
0.22501
0.22496
0.23954
0.24651
0.27409
0.26905
0.28692
0.30366

so time to read each batch of files increases with the number of files that have been read.

Python

0.15729
0.15792
0.18449
0.19421
0.16328
0.15744
0.15732
0.15698
0.15875
0.15828
0.16598
0.16010
0.15951
0.15864
0.15954
0.16478
0.15948
0.16920
0.16462
0.16515

so time stay constant, and a bit higher than for julia.

Both python and julia use the same version of hdf5, 1.8.16.

gasagna avatar Sep 11 '16 08:09 gasagna

(Friendly bump) It seems that h5py handles thousands of files quite well. This would suggest that the problem might be in the way we use the hdf5 library. Unfortunately, (for me) I do have enough breadth to go into the details of HDF5.jl and attempt to fix this bug, but if there is anything you think I can try please let me know.

gasagna avatar Sep 12 '16 11:09 gasagna

This makes me wonder if there isn't some state that we're inappropriately sharing across HDF5 files.

StefanKarpinski avatar Sep 12 '16 15:09 StefanKarpinski

I haven't seen any code where we explicitly share state.

However, the HDF5 library maintains caches. And in particular, we use the "delayed close" flag when opening files, since this seems like a good idea in the presence of garbage collection, as it allows performing certain clean-up actions automatically during garbage collection (RAII in C++ terms). This makes HDF5 much easier to use -- otherwise, one has to manually "close" each HDF5 object (dataset, datashape, datatype, attribute, group, ...), which is a non-starter.

So maybe all we're missing is a heavy dose of garbage collection, ensuring that all HDF5-referencing objects are actually collected and finalized?

A better implementation might make use of a (future) Julia feature to finalize objects more eagerly. Another approach would be to manually keep back-pointers from files to objects that live in that file, and manually close all these when the file is closed.

eschnett avatar Sep 12 '16 16:09 eschnett

So maybe all we're missing is a heavy dose of garbage collection, ensuring that all HDF5-referencing objects are actually collected and finalized?

This seems doubtful to me, as calling gc() explicitly does not seem to speed anything up.

Here's my code:

import JLD

function main(fn; enable_gc=false)
    @show enable_gc
    JLD.save(fn, "data", rand(256))
    for i in 1:10
        enable_gc && gc()
        @show i
        @time for j in 1:2000
            JLD.load(fn, "data")
        end
    end
end

main("/tmp/a.jld", enable_gc=true)

And my output (for both values of enable_gc):

enable_gc = false
i = 1
  1.028825 seconds (445.49 k allocations: 26.563 MB, 5.23% gc time)
i = 2
  1.047420 seconds (366.00 k allocations: 23.163 MB, 0.90% gc time)
i = 3
  1.989134 seconds (366.00 k allocations: 23.163 MB, 0.55% gc time)
i = 4
  3.006954 seconds (366.00 k allocations: 23.163 MB, 0.39% gc time)
i = 5
  3.987207 seconds (366.00 k allocations: 23.163 MB, 0.28% gc time)
i = 6
  4.786223 seconds (366.00 k allocations: 23.163 MB, 0.25% gc time)
i = 7
  5.564047 seconds (366.00 k allocations: 23.163 MB, 0.21% gc time)
i = 8
  6.341545 seconds (366.00 k allocations: 23.163 MB, 0.18% gc time)
i = 9
  7.257423 seconds (366.00 k allocations: 23.163 MB, 0.16% gc time)
i = 10
  8.006855 seconds (366.00 k allocations: 23.163 MB, 0.15% gc time)

enable_gc = true
i = 1
  1.019273 seconds (445.49 k allocations: 26.563 MB, 0.73% gc time)
i = 2
  1.168291 seconds (366.00 k allocations: 23.163 MB, 0.75% gc time)
i = 3
  2.006945 seconds (366.00 k allocations: 23.163 MB, 0.44% gc time)
i = 4
  3.047479 seconds (366.00 k allocations: 23.163 MB, 0.32% gc time)
i = 5
  3.787296 seconds (366.00 k allocations: 23.163 MB, 0.27% gc time)
i = 6
  4.606675 seconds (366.00 k allocations: 23.163 MB, 0.22% gc time)
i = 7
  5.297919 seconds (366.00 k allocations: 23.163 MB, 0.19% gc time)
i = 8
  6.101852 seconds (366.00 k allocations: 23.163 MB, 0.16% gc time)
i = 9
  6.809118 seconds (366.00 k allocations: 23.163 MB, 0.16% gc time)
i = 10
  7.578225 seconds (366.00 k allocations: 23.163 MB, 0.12% gc time)

Since I'm opening the same file repeatedly and still see a slowdown (at least in this test case), it seems that HDF5 "caches" are providing no benefit even in a use case where they might be expected to. I think it would be better if JLD.load fully closed the HDF5 file.

garrison avatar Dec 30 '16 19:12 garrison

See also: JuliaIO/HDF5.jl#349

garrison avatar Nov 22 '17 00:11 garrison