JLD.jl
JLD.jl copied to clipboard
stored file size
I have been trying to convert a 256 MB .dat text file to a 2D array .jld file but end up having a 19GB file. The array stored in the text file is 73 x 600008. In the code snippet below the data is stored as Array(Any,2) but I have run a routine in which I separate and convert the data into String/Float32 1d/2d arrays with the same result. When using MAT.jl it saves it as a 1.5GB file.
using HDF5, JLD
data = readdlm(location_in,'\t')
jldopen(location_out, "w") do file
write(file, "all_data", data)
end
Is there anything I need to specify or watch out for?
You should avoid storing Any
using JLD, it takes significantly more space that way. If you will cast your arrays to some concrete types, it will be significantly more compact and will store/load faster.
I too have issues with JLD storing files that are too large (I ctrl-c killed it when grew to 19GB and still ongoing; see below, the table itself is only 1.8MB in memory). Can you elaborate how to cast from a Any to something more space saving, given that each row of an array consist of both Int and String?
julia> sizeof(table_sub)
1806048
julia> map(typeof, table_sub[2,:])
36-element Array{Any,1}:
SubString{String}
Int64
SubString{String}
Int64
... (basically repeats with either Int64 or String for a total of 36)
I am confused because apparently typeof undersands the actual type. But at the top level I see the "36-element Array{Any,1}". What do I do here?
There are two things you can do:
- Put odd columns into
strarray::Matrix{String}
and even columns intointarray::Matrix{Int}
. - Someone who cares about strings should improve JLD's treatment of
SubString
. It seems to saveSubString
literally, meaning it re-saves the entire "parent" string:
julia> using FileIO
julia> strs = split("If on a winter's night a traveler")
7-element Array{SubString{String},1}:
"If"
"on"
"a"
"winter's"
"night"
"a"
"traveler"
julia> save("/tmp/substr.jld", "strs", strs)
$ h5dump /tmp/substr.jld
HDF5 "/tmp/substr.jld" {
GROUP "/" {
GROUP "_creator" {
DATASET "ENDIAN_BOM" {
DATATYPE H5T_STD_U32LE
DATASPACE SCALAR
DATA {
(0): 67305985
}
}
DATASET "JULIA_MAJOR" {
DATATYPE H5T_STD_I64LE
DATASPACE SCALAR
DATA {
(0): 0
}
}
DATASET "JULIA_MINOR" {
DATATYPE H5T_STD_I64LE
DATASPACE SCALAR
DATA {
(0): 6
}
}
DATASET "JULIA_PATCH" {
DATATYPE H5T_STD_I64LE
DATASPACE SCALAR
DATA {
(0): 0
}
}
DATASET "JULIA_PRERELEASE" {
DATATYPE H5T_REFERENCE { H5T_STD_REF_OBJECT }
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): DATASET 3696 /_refs/00000001 , DATASET 4056 /_refs/00000002
}
ATTRIBUTE "julia eltype" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "Core.Any"
}
}
}
DATASET "WORD_SIZE" {
DATATYPE H5T_STD_I64LE
DATASPACE SCALAR
DATA {
(0): 64
}
}
}
GROUP "_refs" {
DATASET "00000001" {
DATATYPE H5T_STRING {
STRSIZE 3;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "rc2"
}
}
DATASET "00000002" {
DATATYPE H5T_STD_I64LE
DATASPACE SCALAR
DATA {
(0): 0
}
}
DATASET "00000003" {
DATATYPE "/_types/00000001"
DATASPACE SCALAR
DATA {
(0): {
"If on a winter's night a traveler",
0,
2
}
}
}
DATASET "00000004" {
DATATYPE "/_types/00000001"
DATASPACE SCALAR
DATA {
(0): {
"If on a winter's night a traveler",
3,
2
}
}
}
DATASET "00000005" {
DATATYPE "/_types/00000001"
DATASPACE SCALAR
DATA {
(0): {
"If on a winter's night a traveler",
6,
1
}
}
}
DATASET "00000006" {
DATATYPE "/_types/00000001"
DATASPACE SCALAR
DATA {
(0): {
"If on a winter's night a traveler",
8,
8
}
}
}
DATASET "00000007" {
DATATYPE "/_types/00000001"
DATASPACE SCALAR
DATA {
(0): {
"If on a winter's night a traveler",
17,
5
}
}
}
DATASET "00000008" {
DATATYPE "/_types/00000001"
DATASPACE SCALAR
DATA {
(0): {
"If on a winter's night a traveler",
23,
1
}
}
}
DATASET "00000009" {
DATATYPE "/_types/00000001"
DATASPACE SCALAR
DATA {
(0): {
"If on a winter's night a traveler",
25,
8
}
}
}
}
GROUP "_types" {
DATATYPE "00000001" H5T_COMPOUND {
H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
} "string_";
H5T_STD_I64LE "offset_";
H5T_STD_I64LE "endof_";
}
ATTRIBUTE "julia type" {
DATATYPE H5T_STRING {
STRSIZE 27;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "Base.SubString{Core.String}"
}
}
}
DATASET "strs" {
DATATYPE H5T_REFERENCE { H5T_STD_REF_OBJECT }
DATASPACE SIMPLE { ( 7 ) / ( 7 ) }
DATA {
(0): DATASET 6752 /_refs/00000003 , DATASET 11216 /_refs/00000004 ,
(2): DATASET 11584 /_refs/00000005 , DATASET 11952 /_refs/00000006 ,
(4): DATASET 12320 /_refs/00000007 , DATASET 12688 /_refs/00000008 ,
(6): DATASET 13056 /_refs/00000009
}
ATTRIBUTE "julia eltype" {
DATATYPE H5T_STRING {
STRSIZE 27;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "Base.SubString{Core.String}"
}
}
}
}
}
So, I wasn't able to get to the bottom of this during the hackathon of Juliacon 2017, however, my hunch--and please tell me if I am wrong--at the moment tells me that JLD.jl is quite thin. I could not find any logic in JLD.jl that specifically mentions any types. It is broken down into general categories (basic, array, references ... ) and the rest seems to be handled by HDF5.
For SubArray, each SubArray is stored as a H5T_COMPOUND that consist of two references: one to the original Array (ie. it has a H5T_REFERENCE to parent) and a second one to a pair of offsets; that's just enough information to represent a view into the original Array.
For SubString, each SubString is stored as a H5T_COMPOUND that simply stores H5T_STRING (this is what causes the whole original string being copied all over verbatim), and the two offsets.
But even in HDF5.jl, I couldn't find specific mentions of either SubArray or SubString. There must be something more general--more abstracted--that's detecting parents ... (interesting ... I will keep investigating this route).
If someone knows any tips, please let me know.
OK. My earlier comment was wrong, which I deleted. The following should actually read that both SubArray and SubString contain references to the parent.
from base/subarray.jl: struct SubArray{T,N,P,I,L} <: AbstractArray{T,N} parent::P indexes::I offset1::Int # for linear indexing and pointer, only valid when L==true stride1::Int # used only for linear indexing
from base/strings/types.jl struct SubString{T<:AbstractString} <: AbstractString string::T offset::Int endof::Int
I guess it is up to me to find out where the saving logic really happens.
@dkdog an option would be to write custom serializers for types you are interested in. E.g. https://gist.github.com/maximsch2/4257a23911b7fe71e5ec519fc23082ff
@maximsch2 has the right plan here. Last I checked there's a little bit of this in JLD itself (see how Associatives are handled), but you're right that JLD emphasizes generic methods that should work for "anything" rather than customizing the logic for different types. But of course that means they'll be efficient for some things, and views are a great example.
Julia's base serializer handles SubArrays by trimming them. We could do that in JLD, or we could take your suggestion and write the full array once and then use a reference for the rest.