JLD.jl icon indicating copy to clipboard operation
JLD.jl copied to clipboard

stored file size

Open aeroflux opened this issue 8 years ago • 7 comments

I have been trying to convert a 256 MB .dat text file to a 2D array .jld file but end up having a 19GB file. The array stored in the text file is 73 x 600008. In the code snippet below the data is stored as Array(Any,2) but I have run a routine in which I separate and convert the data into String/Float32 1d/2d arrays with the same result. When using MAT.jl it saves it as a 1.5GB file.

using HDF5, JLD data = readdlm(location_in,'\t') jldopen(location_out, "w") do file write(file, "all_data", data) end

Is there anything I need to specify or watch out for?

aeroflux avatar Feb 05 '17 15:02 aeroflux

You should avoid storing Any using JLD, it takes significantly more space that way. If you will cast your arrays to some concrete types, it will be significantly more compact and will store/load faster.

maximsch2 avatar Feb 07 '17 21:02 maximsch2

I too have issues with JLD storing files that are too large (I ctrl-c killed it when grew to 19GB and still ongoing; see below, the table itself is only 1.8MB in memory). Can you elaborate how to cast from a Any to something more space saving, given that each row of an array consist of both Int and String?

julia> sizeof(table_sub)
1806048
julia> map(typeof, table_sub[2,:])
36-element Array{Any,1}:
 SubString{String}
 Int64
 SubString{String}
 Int64
... (basically repeats with either Int64 or String for a total of 36)

I am confused because apparently typeof undersands the actual type. But at the top level I see the "36-element Array{Any,1}". What do I do here?

dkdog avatar Jun 04 '17 06:06 dkdog

There are two things you can do:

  • Put odd columns into strarray::Matrix{String} and even columns into intarray::Matrix{Int}.
  • Someone who cares about strings should improve JLD's treatment of SubString. It seems to save SubString literally, meaning it re-saves the entire "parent" string:
julia> using FileIO

julia> strs = split("If on a winter's night a traveler")
7-element Array{SubString{String},1}:
 "If"      
 "on"      
 "a"       
 "winter's"
 "night"   
 "a"       
 "traveler"

julia> save("/tmp/substr.jld", "strs", strs)
$ h5dump /tmp/substr.jld 
HDF5 "/tmp/substr.jld" {
GROUP "/" {
   GROUP "_creator" {
      DATASET "ENDIAN_BOM" {
         DATATYPE  H5T_STD_U32LE
         DATASPACE  SCALAR
         DATA {
         (0): 67305985
         }
      }
      DATASET "JULIA_MAJOR" {
         DATATYPE  H5T_STD_I64LE
         DATASPACE  SCALAR
         DATA {
         (0): 0
         }
      }
      DATASET "JULIA_MINOR" {
         DATATYPE  H5T_STD_I64LE
         DATASPACE  SCALAR
         DATA {
         (0): 6
         }
      }
      DATASET "JULIA_PATCH" {
         DATATYPE  H5T_STD_I64LE
         DATASPACE  SCALAR
         DATA {
         (0): 0
         }
      }
      DATASET "JULIA_PRERELEASE" {
         DATATYPE  H5T_REFERENCE { H5T_STD_REF_OBJECT }
         DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
         DATA {
         (0): DATASET 3696 /_refs/00000001 , DATASET 4056 /_refs/00000002 
         }
         ATTRIBUTE "julia eltype" {
            DATATYPE  H5T_STRING {
               STRSIZE 8;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_UTF8;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
            DATA {
            (0): "Core.Any"
            }
         }
      }
      DATASET "WORD_SIZE" {
         DATATYPE  H5T_STD_I64LE
         DATASPACE  SCALAR
         DATA {
         (0): 64
         }
      }
   }
   GROUP "_refs" {
      DATASET "00000001" {
         DATATYPE  H5T_STRING {
            STRSIZE 3;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "rc2"
         }
      }
      DATASET "00000002" {
         DATATYPE  H5T_STD_I64LE
         DATASPACE  SCALAR
         DATA {
         (0): 0
         }
      }
      DATASET "00000003" {
         DATATYPE  "/_types/00000001"
         DATASPACE  SCALAR
         DATA {
         (0): {
               "If on a winter's night a traveler",
               0,
               2
            }
         }
      }
      DATASET "00000004" {
         DATATYPE  "/_types/00000001"
         DATASPACE  SCALAR
         DATA {
         (0): {
               "If on a winter's night a traveler",
               3,
               2
            }
         }
      }
      DATASET "00000005" {
         DATATYPE  "/_types/00000001"
         DATASPACE  SCALAR
         DATA {
         (0): {
               "If on a winter's night a traveler",
               6,
               1
            }
         }
      }
      DATASET "00000006" {
         DATATYPE  "/_types/00000001"
         DATASPACE  SCALAR
         DATA {
         (0): {
               "If on a winter's night a traveler",
               8,
               8
            }
         }
      }
      DATASET "00000007" {
         DATATYPE  "/_types/00000001"
         DATASPACE  SCALAR
         DATA {
         (0): {
               "If on a winter's night a traveler",
               17,
               5
            }
         }
      }
      DATASET "00000008" {
         DATATYPE  "/_types/00000001"
         DATASPACE  SCALAR
         DATA {
         (0): {
               "If on a winter's night a traveler",
               23,
               1
            }
         }
      }
      DATASET "00000009" {
         DATATYPE  "/_types/00000001"
         DATASPACE  SCALAR
         DATA {
         (0): {
               "If on a winter's night a traveler",
               25,
               8
            }
         }
      }
   }
   GROUP "_types" {
      DATATYPE "00000001" H5T_COMPOUND {
         H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         } "string_";
         H5T_STD_I64LE "offset_";
         H5T_STD_I64LE "endof_";
      }
         ATTRIBUTE "julia type" {
            DATATYPE  H5T_STRING {
               STRSIZE 27;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_UTF8;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
            DATA {
            (0): "Base.SubString{Core.String}"
            }
         }
   }
   DATASET "strs" {
      DATATYPE  H5T_REFERENCE { H5T_STD_REF_OBJECT }
      DATASPACE  SIMPLE { ( 7 ) / ( 7 ) }
      DATA {
      (0): DATASET 6752 /_refs/00000003 , DATASET 11216 /_refs/00000004 ,
      (2): DATASET 11584 /_refs/00000005 , DATASET 11952 /_refs/00000006 ,
      (4): DATASET 12320 /_refs/00000007 , DATASET 12688 /_refs/00000008 ,
      (6): DATASET 13056 /_refs/00000009 
      }
      ATTRIBUTE "julia eltype" {
         DATATYPE  H5T_STRING {
            STRSIZE 27;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "Base.SubString{Core.String}"
         }
      }
   }
}
}

timholy avatar Jun 04 '17 08:06 timholy

So, I wasn't able to get to the bottom of this during the hackathon of Juliacon 2017, however, my hunch--and please tell me if I am wrong--at the moment tells me that JLD.jl is quite thin. I could not find any logic in JLD.jl that specifically mentions any types. It is broken down into general categories (basic, array, references ... ) and the rest seems to be handled by HDF5.

For SubArray, each SubArray is stored as a H5T_COMPOUND that consist of two references: one to the original Array (ie. it has a H5T_REFERENCE to parent) and a second one to a pair of offsets; that's just enough information to represent a view into the original Array.

For SubString, each SubString is stored as a H5T_COMPOUND that simply stores H5T_STRING (this is what causes the whole original string being copied all over verbatim), and the two offsets.

But even in HDF5.jl, I couldn't find specific mentions of either SubArray or SubString. There must be something more general--more abstracted--that's detecting parents ... (interesting ... I will keep investigating this route).

If someone knows any tips, please let me know.

dkdog avatar Jun 24 '17 23:06 dkdog

OK. My earlier comment was wrong, which I deleted. The following should actually read that both SubArray and SubString contain references to the parent.

from base/subarray.jl: struct SubArray{T,N,P,I,L} <: AbstractArray{T,N} parent::P indexes::I offset1::Int # for linear indexing and pointer, only valid when L==true stride1::Int # used only for linear indexing

from base/strings/types.jl struct SubString{T<:AbstractString} <: AbstractString string::T offset::Int endof::Int

I guess it is up to me to find out where the saving logic really happens.

dkdog avatar Jun 24 '17 23:06 dkdog

@dkdog an option would be to write custom serializers for types you are interested in. E.g. https://gist.github.com/maximsch2/4257a23911b7fe71e5ec519fc23082ff

maximsch2 avatar Jun 24 '17 23:06 maximsch2

@maximsch2 has the right plan here. Last I checked there's a little bit of this in JLD itself (see how Associatives are handled), but you're right that JLD emphasizes generic methods that should work for "anything" rather than customizing the logic for different types. But of course that means they'll be efficient for some things, and views are a great example.

Julia's base serializer handles SubArrays by trimming them. We could do that in JLD, or we could take your suggestion and write the full array once and then use a reference for the rest.

timholy avatar Jun 25 '17 08:06 timholy