JLD.jl
JLD.jl copied to clipboard
Possibly unreasonable storage size of Array{Tuple{Int64,Int64,Float64},1}:
Dearests,
I want to store an array of type x::Array{Tuple{Int64,Int64,Float64},1}
. Hereafter is the result
julia> x
10000-element Array{Tuple{Int64,Int64,Float64},1}
now I @save "x.jld" x
and the size of the file is
-rw-r--r-- 1 pagnani staff 4096424 Oct 15 17:02 x.jld
if I now define xmat = [x[i][j] for i=1:length(x), j=1:3]
and obtain Array{Any,2}
and do a
writedlm("xmat.txt", xmat)
260652 Oct 15 17:07 xmat.txt
In other words the jld file is almost 16 times larger. Note that 10000 * 3 * 8 /1024 = 234 Kb
should be the theoretical lower bound which is almost achieved from dlmwrite!!
It might be a known issue with tuples but I was not able to find neither here nor on the mailing list. In case it is already known ... sorry for the noise.
Thanks a lot for your work
Andrea
... note also that given xmat = [x[i][j] for i=1:length(x), j=1:3]
the @save "xmat.jld" xmat
returns me a scary
-rw-r--r-- 1 pagnani staff 11976712 Oct 15 17:21 xmat.jld
i.e. almost 46 times larger than the writedlm
result.
... and I forgot to say that I am on a
Julia Version 0.4.1-pre+1
Commit 70b42a6* (2015-10-08 06:21 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
....
- BinDeps 0.3.18
- CUDA 0.1.0
- Docile 0.5.19
- Gadfly 0.3.17
- Graphs 0.5.6
- Gtk 0.9.2+ master
- Gurobi 0.1.30
- HDF5 0.5.6
- Immerse 0.0.8 master
- JLD 0.5.5
- Jewel 1.0.7
- JuMP 0.10.2
- LsqFit 0.0.2
- MAT 0.2.13
- Markdown 0.3.0
- MathProgBase 0.3.18
- NLsolve 0.4.0+ master
- ProfileView 0.1.1
- ProgressMeter 0.2.1
- PyCall 1.1.2
- PyPlot 2.1.1
55 additional packages:
- ArrayViews 0.6.4
- Blosc 0.1.4
- Cairo 0.2.31
- Calculus 0.1.13
- Codecs 0.1.5
- ColorTypes 0.1.7
- Colors 0.5.4
- Compat 0.7.6
- Compose 0.3.17
- Conda 0.1.7
- Contour 0.0.8
- DataArrays 0.2.19
- DataFrames 0.6.10
- DataStructures 0.3.13
- Dates 0.4.4
- Distances 0.2.1
- Distributions 0.8.7
- DualNumbers 0.1.5
- FactCheck 0.4.1
- FastaIO 0.1.4
- FixedPointNumbers 0.0.12
- GZip 0.2.18
- GaussDCA 0.0.0- master (unregistered)
- Graphics 0.1.3
- Grid 0.4.0
- GtkUtilities 0.0.6
- Hexagons 0.0.4
- Homebrew 0.1.16+ master
- ImmutableArrays 0.0.11
- Iterators 0.1.9
- JSON 0.5.0
- JuliaParser 0.6.3
- KernelDensity 0.1.2
- LNR 0.0.1
- LaTeXStrings 0.1.6
- Lazy 0.10.0
- Loess 0.0.4
- MacroTools 0.2.0
- MacroUtils 0.0.0- master (unregistered)
- NLopt 0.2.3
- NaNMath 0.1.1
- Optim 0.4.3
- PDMats 0.3.6
- PlmDCA 0.0.0- master (unregistered)
- Reexport 0.0.3
- Requires 0.2.0
- ReverseDiffSparse 0.2.11
- SHA 0.1.2
- Showoff 0.0.6
- SortingAlgorithms 0.0.6
- StatsBase 0.7.4
- StatsFuns 0.1.4
- URIParser 0.1.1
- WoodburyMatrices 0.1.2
- Zlib 0.1.10
If you look in jld_types.jl
, you'll see there's an INLINE_TUPLE
option to save tuples in compressed format, but that's currently off. You might experiment with what happens when you turn it on?
Alternatively, see https://github.com/JuliaLang/JLD.jl/blob/master/doc/jld.md#custom-serialization
Hi Tim,
I feel really sorry to come up to bother you at intermittent time with different problems. But, as usual, the solution you provided, i.e. turning on the INLINE_TUPLE
flag to true
, makes a huge difference.
-rw-r--r-- 1 pagnani staff 243424 Oct 30 07:32 test_tuple_inline.jld
-rw-r--r-- 1 pagnani staff 4096424 Oct 30 07:33 test_tuple_noinline.jld
Any reasons to not turning it to true
by default?
Thanks a lot.
Does it pass all the tests? I haven't tried.
I don't know if it passes all the tests, but you can't just flip the switch, since it will break reading older JLD files.
Indeed it does not pass the tests.
For the time being I'm ok with the flag solution. Thanks a lot as usual.
It might however be useful to leave this issue open for future reference if you decide to make the tuple inline a default.
In JLD2 (which I have to finish at some point; it is mostly done, just needs compression and maybe user-defined groups), this is already what happens. Also fields that aren't stored inline are orders of magnitude faster...
Is JLD2 going to supersede JLD? Or the improvements are going to to be ported to JLD. I was under the impression that JLD would be a sort of standard format for julia (whatever standard might mean). I'm asking because we are going to launch a pretty extensive project in bioinformatics with severe IO and I'm wondering if storing everything in .jld is now a good idea.
The code in the JLD2 repository will eventually move here, but the JLD API won't change much, and we'll always maintain the ability to read older JLD files.
Good to know!
Completely off-topic: matlab has its own standard format to store data (.mat), other languages as well. Wouldn't be a good idea to move JLD into julia base (as a module of course) eventually? Has this already been discussed somewhere?
Thanks
JLD basically is the standard format for Julia.
The trend seems to be to move things from Base into packages, so that over time julia will become "just the language" and everything else is a package. This even includes things like LAPACK/matrix operations.