HDF5.jl
HDF5.jl copied to clipboard
No Intuitive Way to Write Compound Data Types
While it is incredibly easy to read HDF5 files and access the data within (no matter the complexity of the datatypes and structure of the files). After hours of looking at documentation and reading the source code, I still do not see a straight forward way to write compound data types to a file.
A simple task to demonstrate where the api is failing me is the the following:
Take two hdf5 datasets with exactly the same structure, and 'concatenate' the two maintaining the original format.
dset1 = | Col1 (Int32) | Col2 (String) | # This syntax is just supposed to visualize the compound datatype
dset2 = | Col1 (Int32) | Col2 (String) | # Rather than to show actual output of Julia REPL
I can simply read in the data with
using HDF5
filename1, filename2, datasetName, newFilename = # Strings
h5_file1, h5_file2 = h5open(filename1), h5open(filename2) # load files
h5_dict1, h5_dict2 = read(h5_file1), read(h5_file2) # Turn HDF5.Files into Dictionaries
dset1, dset2 = h5_dict1[datasetName ], h5_dict2[datasetName] # Returns a Vector{NamedTuple}
# with names specifying the Col
# Try to concatenate datasets
h5open(newFilename,"w") do wf
create_dataset(wf,
datasetName ,
datatype(h5_file1[datasetName]), # obtain the compound datatype for dataset
(length(dset1) + length(dset2),)) # pre-allocate space for the combined datasets
for (i, datum) in enumerate(dset1)
wf[datasetName][i] = datum # this causes an error
end
for (i, datum) in enumerate(dset2)
wf[datasetName][i + length(dset1)] = datum # this causes an error
end
end
Here is the error message
ERROR: MethodError: no method matching setindex!(::HDF5.Dataset, ::NamedTuple{(Col1, Col2), (Int32, String)}, Int64}
I believe that there is a way to write the compound datasets, but I cannot find it anywhere. Could someone enlighten me?
Thanks for the issue.
Does this example help at all?
using HDF5
struct Foo
a::Int32
b::String
end
struct Foo_hdf5
a::Int32
b::Cstring
end
function Base.unsafe_convert(::Type{Foo_hdf5}, x::Foo)
Foo_hdf5(x.a, Base.unsafe_convert(Cstring, x.b))
end
function HDF5.datatype(::Type{Foo_hdf5})
dtype = HDF5.h5t_create(HDF5.H5T_COMPOUND, sizeof(Foo_hdf5))
HDF5.h5t_insert(dtype, "a", fieldoffset(Foo_hdf5, 1), datatype(Int32))
vlenstr_dtype = HDF5.h5t_copy(HDF5.H5T_C_S1)
HDF5.h5t_set_size(vlenstr_dtype, HDF5.H5T_VARIABLE)
HDF5.h5t_set_cset(vlenstr_dtype, HDF5.H5T_CSET_UTF8)
HDF5.h5t_insert(dtype, "b", fieldoffset(Foo_hdf5, 2), vlenstr_dtype)
HDF5.Datatype(dtype)
end
# for convenience
using Random
Base.rand(rng::AbstractRNG, ::Type{Foo}) = Foo(rand(rng, Int32),randstring(rng))
N = 4
v = [rand(Foo) for i in 1:N]
v_write = Base.unsafe_convert.(Foo_hdf5, v)
fn = tempname()
h5f = h5open(fn, "w")
dtype = datatype(Foo_hdf5)
dspace = dataspace(v_write)
dset = create_dataset(h5f, "foo", dtype, dspace)
write_dataset(dset, dtype, v_write)
I agree that we should aim to simplify writing compound datasets. I think the main issue is selecting the default String type in a compound dataset.
We are certainly lacking examples on writing compound data types in the documentation.
We recently overhauled and greatly improved reading compound data types. I agree that writing them is currently cumbersome and we should improve the API.
@musm Thank you for the quick reply. This did solve my issue. I appreciate your help
Leaving the issue open so we can track improving the documentation in this regard.
@jmert Would you be opposed or have any thoughts on adding something along the lines of ('pseudo-code'):
function HDF5.datatype(::Type{T}) where T
dtype = HDF5.h5t_create(HDF5.H5T_COMPOUND, sizeof(T))
for i in 1:nfields(T)
HDF5.h5t_insert(dtype, fieldname(T,i(, fieldoffset(T,i), fieldtype(T,i))
end
HDF5.Datatype(dtype)
end
We'll have to specify a default method for string types, but it I think H5T_VARIABLE would be a sensible choice.
@jmert Would you be opposed or have any thoughts on adding something along the lines of ('pseudo-code'):
Yeah, I think that kind of idea is the right way to go — being able to more conveniently build the type spec for simple structs is a big convenience item we could add. We might want to think about/prototype how that interacts with the generic NamedTuple read and if we could consolidate the behavior a bit, but this kind of type-translation is what I've had in the back of my mind to work towards.
Cross-referencing a post of mine on the Julia Discourse.
tl;dr isn't there currently at least one fairly straightforward way (involving NamedTuples) of reading and writing compound data types that would be helpful to document?
I thought we addressed this. Let me see.
To summarize my comment on the Discourse thread, we would welcome a pull request to add this documentation.
My intention was to finish integration of StaticStrings.jl, but this took longer to develop than I imagined. Integrating StaticStrings.jl into HDF5.jl will probably need to be a breaking change as well and thus require a bump in the minor version per semantic versioning.
Feel free to document what you need.
On another note, consider defining a new struct rather than using a NamedTuple. This allows you more control over conversion without piracy.