JLD.jl icon indicating copy to clipboard operation
JLD.jl copied to clipboard

Possibly unreasonable storage size of Array{Tuple{Int64,Int64,Float64},1}:

Open pagnani opened this issue 9 years ago • 12 comments

Dearests,

I want to store an array of type x::Array{Tuple{Int64,Int64,Float64},1}. Hereafter is the result

julia> x
10000-element Array{Tuple{Int64,Int64,Float64},1}

now I @save "x.jld" x and the size of the file is

-rw-r--r-- 1 pagnani staff 4096424 Oct 15 17:02 x.jld

if I now define xmat = [x[i][j] for i=1:length(x), j=1:3] and obtain Array{Any,2} and do a writedlm("xmat.txt", xmat)

260652 Oct 15 17:07 xmat.txt

In other words the jld file is almost 16 times larger. Note that 10000 * 3 * 8 /1024 = 234 Kb should be the theoretical lower bound which is almost achieved from dlmwrite!!

It might be a known issue with tuples but I was not able to find neither here nor on the mailing list. In case it is already known ... sorry for the noise.

Thanks a lot for your work

Andrea

pagnani avatar Oct 15 '15 15:10 pagnani

... note also that given xmat = [x[i][j] for i=1:length(x), j=1:3] the @save "xmat.jld" xmat returns me a scary

-rw-r--r--  1 pagnani staff 11976712 Oct 15 17:21 xmat.jld

i.e. almost 46 times larger than the writedlm result.

pagnani avatar Oct 15 '15 15:10 pagnani

... and I forgot to say that I am on a

Julia Version 0.4.1-pre+1
Commit 70b42a6* (2015-10-08 06:21 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
....
- BinDeps                       0.3.18
 - CUDA                          0.1.0
 - Docile                        0.5.19
 - Gadfly                        0.3.17
 - Graphs                        0.5.6
 - Gtk                           0.9.2+             master
 - Gurobi                        0.1.30
 - HDF5                          0.5.6
 - Immerse                       0.0.8              master
 - JLD                           0.5.5
 - Jewel                         1.0.7
 - JuMP                          0.10.2
 - LsqFit                        0.0.2
 - MAT                           0.2.13
 - Markdown                      0.3.0
 - MathProgBase                  0.3.18
 - NLsolve                       0.4.0+             master
 - ProfileView                   0.1.1
 - ProgressMeter                 0.2.1
 - PyCall                        1.1.2
 - PyPlot                        2.1.1
55 additional packages:
 - ArrayViews                    0.6.4
 - Blosc                         0.1.4
 - Cairo                         0.2.31
 - Calculus                      0.1.13
 - Codecs                        0.1.5
 - ColorTypes                    0.1.7
 - Colors                        0.5.4
 - Compat                        0.7.6
 - Compose                       0.3.17
 - Conda                         0.1.7
 - Contour                       0.0.8
 - DataArrays                    0.2.19
 - DataFrames                    0.6.10
 - DataStructures                0.3.13
 - Dates                         0.4.4
 - Distances                     0.2.1
 - Distributions                 0.8.7
 - DualNumbers                   0.1.5
 - FactCheck                     0.4.1
 - FastaIO                       0.1.4
 - FixedPointNumbers             0.0.12
 - GZip                          0.2.18
 - GaussDCA                      0.0.0-             master (unregistered)
 - Graphics                      0.1.3
 - Grid                          0.4.0
 - GtkUtilities                  0.0.6
 - Hexagons                      0.0.4
 - Homebrew                      0.1.16+            master
 - ImmutableArrays               0.0.11
 - Iterators                     0.1.9
 - JSON                          0.5.0
 - JuliaParser                   0.6.3
 - KernelDensity                 0.1.2
 - LNR                           0.0.1
 - LaTeXStrings                  0.1.6
 - Lazy                          0.10.0
 - Loess                         0.0.4
 - MacroTools                    0.2.0
 - MacroUtils                    0.0.0-             master (unregistered)
 - NLopt                         0.2.3
 - NaNMath                       0.1.1
 - Optim                         0.4.3
 - PDMats                        0.3.6
 - PlmDCA                        0.0.0-             master (unregistered)
 - Reexport                      0.0.3
 - Requires                      0.2.0
 - ReverseDiffSparse             0.2.11
 - SHA                           0.1.2
 - Showoff                       0.0.6
 - SortingAlgorithms             0.0.6
 - StatsBase                     0.7.4
 - StatsFuns                     0.1.4
 - URIParser                     0.1.1
 - WoodburyMatrices              0.1.2
 - Zlib                          0.1.10

pagnani avatar Oct 15 '15 15:10 pagnani

If you look in jld_types.jl, you'll see there's an INLINE_TUPLE option to save tuples in compressed format, but that's currently off. You might experiment with what happens when you turn it on?

Alternatively, see https://github.com/JuliaLang/JLD.jl/blob/master/doc/jld.md#custom-serialization

timholy avatar Oct 29 '15 10:10 timholy

Hi Tim,

I feel really sorry to come up to bother you at intermittent time with different problems. But, as usual, the solution you provided, i.e. turning on the INLINE_TUPLE flag to true, makes a huge difference.

-rw-r--r-- 1 pagnani staff  243424 Oct 30 07:32 test_tuple_inline.jld
-rw-r--r-- 1 pagnani staff 4096424 Oct 30 07:33 test_tuple_noinline.jld

Any reasons to not turning it to true by default? Thanks a lot.

pagnani avatar Oct 30 '15 06:10 pagnani

Does it pass all the tests? I haven't tried.

timholy avatar Oct 30 '15 09:10 timholy

I don't know if it passes all the tests, but you can't just flip the switch, since it will break reading older JLD files.

simonster avatar Oct 30 '15 13:10 simonster

Indeed it does not pass the tests.

For the time being I'm ok with the flag solution. Thanks a lot as usual.

It might however be useful to leave this issue open for future reference if you decide to make the tuple inline a default.

pagnani avatar Oct 30 '15 15:10 pagnani

In JLD2 (which I have to finish at some point; it is mostly done, just needs compression and maybe user-defined groups), this is already what happens. Also fields that aren't stored inline are orders of magnitude faster...

simonster avatar Oct 30 '15 15:10 simonster

Is JLD2 going to supersede JLD? Or the improvements are going to to be ported to JLD. I was under the impression that JLD would be a sort of standard format for julia (whatever standard might mean). I'm asking because we are going to launch a pretty extensive project in bioinformatics with severe IO and I'm wondering if storing everything in .jld is now a good idea.

pagnani avatar Oct 30 '15 15:10 pagnani

The code in the JLD2 repository will eventually move here, but the JLD API won't change much, and we'll always maintain the ability to read older JLD files.

simonster avatar Oct 30 '15 15:10 simonster

Good to know!

Completely off-topic: matlab has its own standard format to store data (.mat), other languages as well. Wouldn't be a good idea to move JLD into julia base (as a module of course) eventually? Has this already been discussed somewhere?

Thanks

pagnani avatar Oct 30 '15 16:10 pagnani

JLD basically is the standard format for Julia.

The trend seems to be to move things from Base into packages, so that over time julia will become "just the language" and everything else is a package. This even includes things like LAPACK/matrix operations.

timholy avatar Oct 30 '15 16:10 timholy