JLD.jl icon indicating copy to clipboard operation
JLD.jl copied to clipboard

load of Array{DateTime,1} getting slower inside loop

Open abieler opened this issue 9 years ago • 8 comments
trafficstars

jldTimings.zip When loading an array with DateTimes with the load() function the loading times increase over time. The same does not happen when loading an array of floats.

Attached are a julia script and two data files to reproduce this behavior. Run the script with julia timeJLD.jl N where N is the number of iterations.

myDates.jld has the array with datetimes date myArray.jld has the array with floats yy

I ran with N = 5k to 10k.

using HDF5
using JLD
using PyPlot

function myLoop(N, timings)
    for i = 1:N
      timings[i] = @elapsed tt = load(fileName, "date")
      #timings[i] = @elapsed tt = load(fileName, "yy")
    end
end

N = parse(Int, ARGS[1]) 
fileName = "myDates.jld"
#fileName = "myArray.jld"

timings = Array(Float64, N)

myLoop(N, timings)

figure()
semilogy(timings)
show()

In real life I load the content from different files of course... Cheers Andre

abieler avatar Apr 27 '16 17:04 abieler

I forgot I am on linux and v0.4.5

abieler avatar Apr 27 '16 19:04 abieler

Sounds like this is the issue as well. With how long it's been around, it seems like they have marked as "Do not fix" with JLD. Quite a shame really. Sounds like you'll have to figure out how to use the HDF5 format instead as well.

Skylion007 avatar May 05 '16 04:05 Skylion007

I now convert my dates to unix-time and save them as h5. then loading and converting back to dates with Dates.unix2datetime()

I attached timings for two versions of loading the data. 1st with h5read() and 2nd with opening the file for read with fid = h5open() and then loading data with read(fid, ...).

Not surprising the last version is the fastest. For the first 1 k loops the timing differences seem almost constant, but after ~10 k iterations the jld version is about 2 orders of magnitude slower. If I get to it I ll do some profiling.

newTimings.zip

abieler avatar May 05 '16 06:05 abieler

So most time is spent in h5f_get_obj_ids() in HDF5/src/plain.jl at line 2182 and 2186 which is a ccall to (:H5Fget_obj_count, libhdf5) and (:H5Fget_obj_ids, libhdf5) respectively.

So not sure something can be done about this..

cheers andre

abieler avatar May 05 '16 07:05 abieler

Bless you, @abieler, for digging into this! So it's definitely the C library, not any of the julia code.

Try the trick in the last post of that issue, https://github.com/JuliaLang/HDF5.jl/issues/170#issuecomment-209399736?

timholy avatar May 05 '16 10:05 timholy

Not sure it is the same problem. This here is loading content from a small file a lot of times, the other is creating a file with lots of entries. I ll try anyway of course ;)

abieler avatar May 05 '16 10:05 abieler

Oh, I see (I didn't read carefully enough). You might consider using the "dictionary interface," https://github.com/JuliaLang/JLD.jl/blob/master/doc/jld.md#usage, so it doesn't waste time opening/closing the file frequently.

timholy avatar May 05 '16 11:05 timholy

Also appears similar to https://github.com/JuliaLang/julia/issues/17554

JeffBezanson avatar Jul 22 '16 16:07 JeffBezanson