HDF5.jl icon indicating copy to clipboard operation
HDF5.jl copied to clipboard

No Intuitive Way to Write Compound Data Types

Open nicrummel opened this issue 4 years ago • 9 comments

While it is incredibly easy to read HDF5 files and access the data within (no matter the complexity of the datatypes and structure of the files). After hours of looking at documentation and reading the source code, I still do not see a straight forward way to write compound data types to a file.

A simple task to demonstrate where the api is failing me is the the following:

Take two hdf5 datasets with exactly the same structure, and 'concatenate' the two maintaining the original format.

dset1 = | Col1 (Int32) | Col2 (String) | # This syntax is just supposed to visualize the compound datatype
dset2 = | Col1 (Int32) | Col2 (String) | # Rather than to show actual output of Julia REPL

I can simply read in the data with

using HDF5
filename1, filename2, datasetName, newFilename = # Strings
h5_file1, h5_file2 = h5open(filename1), h5open(filename2) # load files 
h5_dict1,  h5_dict2 = read(h5_file1), read(h5_file2) # Turn HDF5.Files into Dictionaries
dset1, dset2 = h5_dict1[datasetName ], h5_dict2[datasetName] # Returns a Vector{NamedTuple} 
                                                             # with names specifying the Col
# Try to concatenate datasets
h5open(newFilename,"w") do wf
    create_dataset(wf, 
                   datasetName , 
                   datatype(h5_file1[datasetName]), # obtain the compound datatype for dataset
                   (length(dset1) + length(dset2),)) # pre-allocate space for the combined datasets 
    for (i, datum) in enumerate(dset1) 
        wf[datasetName][i] = datum # this causes an error
    end
    for (i, datum) in enumerate(dset2) 
        wf[datasetName][i + length(dset1)] = datum # this causes an error
    end
end

Here is the error message

ERROR: MethodError: no method matching setindex!(::HDF5.Dataset, ::NamedTuple{(Col1, Col2), (Int32, String)}, Int64}

I believe that there is a way to write the compound datasets, but I cannot find it anywhere. Could someone enlighten me?

nicrummel avatar Feb 16 '21 17:02 nicrummel

Thanks for the issue.

Does this example help at all?



using HDF5

struct Foo
    a::Int32
    b::String
end

struct Foo_hdf5
    a::Int32
    b::Cstring
end

function Base.unsafe_convert(::Type{Foo_hdf5}, x::Foo)
    Foo_hdf5(x.a, Base.unsafe_convert(Cstring, x.b))
end

function HDF5.datatype(::Type{Foo_hdf5})
    dtype = HDF5.h5t_create(HDF5.H5T_COMPOUND, sizeof(Foo_hdf5))
    HDF5.h5t_insert(dtype, "a", fieldoffset(Foo_hdf5, 1), datatype(Int32))

    vlenstr_dtype = HDF5.h5t_copy(HDF5.H5T_C_S1)
    HDF5.h5t_set_size(vlenstr_dtype, HDF5.H5T_VARIABLE)
    HDF5.h5t_set_cset(vlenstr_dtype, HDF5.H5T_CSET_UTF8)
    HDF5.h5t_insert(dtype, "b", fieldoffset(Foo_hdf5, 2), vlenstr_dtype)

    HDF5.Datatype(dtype)
end

# for convenience 
using Random
Base.rand(rng::AbstractRNG, ::Type{Foo}) = Foo(rand(rng, Int32),randstring(rng))
N = 4
v = [rand(Foo) for i in 1:N]

v_write = Base.unsafe_convert.(Foo_hdf5, v)

fn = tempname()
h5f = h5open(fn, "w")

dtype = datatype(Foo_hdf5)
dspace = dataspace(v_write)
dset = create_dataset(h5f, "foo", dtype, dspace)
write_dataset(dset, dtype, v_write)


I agree that we should aim to simplify writing compound datasets. I think the main issue is selecting the default String type in a compound dataset.

musm avatar Feb 16 '21 19:02 musm

We are certainly lacking examples on writing compound data types in the documentation.

We recently overhauled and greatly improved reading compound data types. I agree that writing them is currently cumbersome and we should improve the API.

musm avatar Feb 16 '21 19:02 musm

@musm Thank you for the quick reply. This did solve my issue. I appreciate your help

nicrummel avatar Feb 16 '21 19:02 nicrummel

Leaving the issue open so we can track improving the documentation in this regard.

musm avatar Feb 16 '21 19:02 musm

@jmert Would you be opposed or have any thoughts on adding something along the lines of ('pseudo-code'):

function HDF5.datatype(::Type{T}) where T
    dtype = HDF5.h5t_create(HDF5.H5T_COMPOUND, sizeof(T))
    for i in 1:nfields(T)
        HDF5.h5t_insert(dtype, fieldname(T,i(, fieldoffset(T,i), fieldtype(T,i))
    end
    HDF5.Datatype(dtype)
end

We'll have to specify a default method for string types, but it I think H5T_VARIABLE would be a sensible choice.

musm avatar Feb 24 '21 02:02 musm

@jmert Would you be opposed or have any thoughts on adding something along the lines of ('pseudo-code'):

Yeah, I think that kind of idea is the right way to go — being able to more conveniently build the type spec for simple structs is a big convenience item we could add. We might want to think about/prototype how that interacts with the generic NamedTuple read and if we could consolidate the behavior a bit, but this kind of type-translation is what I've had in the back of my mind to work towards.

jmert avatar Feb 25 '21 03:02 jmert

Cross-referencing a post of mine on the Julia Discourse.

tl;dr isn't there currently at least one fairly straightforward way (involving NamedTuples) of reading and writing compound data types that would be helpful to document?

pbouffard avatar Apr 15 '23 17:04 pbouffard

I thought we addressed this. Let me see.

mkitti avatar Apr 15 '23 23:04 mkitti

To summarize my comment on the Discourse thread, we would welcome a pull request to add this documentation.

My intention was to finish integration of StaticStrings.jl, but this took longer to develop than I imagined. Integrating StaticStrings.jl into HDF5.jl will probably need to be a breaking change as well and thus require a bump in the minor version per semantic versioning.

Feel free to document what you need.

On another note, consider defining a new struct rather than using a NamedTuple. This allows you more control over conversion without piracy.

mkitti avatar Apr 16 '23 04:04 mkitti