HDF5.jl
HDF5.jl copied to clipboard
Nested Dictionaries
My mental model of HDF5s is as a folder structure, where related data is grouped together, and buried in a nested / hierarchical format. Currently the read functions deliver a flat dictionary, and the hierarchy is held in strings as opposed to structure. The alternative which matches my mental model is to read an HDF5 in as a nested dictionary, where the value of a key is a datatype if the key refers to a datatype, and the value is a dictionary if the key refers to a group.
So for an HDF5 like:
📂 h5file
├─ 🔢 B
└─ 📂 groupA
├─ 🔢 A1
└─ 🔢 A2
The current read generates:
Dict(
"B" => Bval
"groupA/A1" => A1val
"groupA/A2" => A2val
)
And I'd prefer an option to read_nested as:
Dict(
"B" => Bval
"groupA" => Dict(
"A1" => A1val
"A2" => A2val
)
)
I've written this code locally (plus corresponding write_nested. Would it be reasonable to include it here?
I have several questions:
- What is the function that you are using to "read" the file which creates the
Dict{String, Any}? It appears you might be usingFileIO.load. - Is there a reason we need to create new functions? An alternative would be to add keyword arguments to
FileIO.load. - Would the result also be a
Dict{String, Any}? Would you have options for anOrderedDictfrom OrderedCollections.jl or aDictionaryfrom Dictionaries.jl?
One of my reservations about this is that it seems to encourage the user to read the entire file at once, and I would tend to encourage the user to use lazy interfaces for this purpose.
The nested dictionary behavior that you describe seems closely resemble the current dictionary-like interface.
julia> h5f = h5open("test.h5")
🗂️ HDF5.File: (read-only) test.h5
├─ 🔢 B
└─ 📂 groupA
├─ 🔢 A1
└─ 🔢 A2
julia> h5f["groupA"]["A1"][]
2×2 Matrix{Float64}:
0.43893 0.583493
0.546226 0.652598
The only difference here is the final [] to actually access the data. Perhaps your eager Dict the way to access the contents of A1 would be just h5f["groupA"]["A1"]?
How would also then deal with attributes?
Thanks for the feedback! Here are my replies:
- What is the function that you are using to "read" the file which creates the
Dict{String, Any}? It appears you might be usingFileIO.load.
I wrote a custom read function which uses h5open. the reader code is as follows (writing code is similar). It's not too different from the code in FileIOExt.jl, just a different output:
function load_nested(filename)
h5open(filename) do fid
read_group(fid)
end
end
function read_group(parent)
d = OrderedDict{String,Any}()
for key in keys(parent)
content = read_dataset(parent[key])
merge!(d,Dict(key => content))
end
d
end
read_dataset(val::HDF5.Group) = read_group(val)
read_dataset(val::HDF5.Dataset) = read(val)
- Is there a reason we need to create new functions? An alternative would be to add keyword arguments to
FileIO.load.
I would prefer not to create a new function. My new functions are just my local solution to avoid type piracy. So load_nested(file) could be load(file; nested=true) or similar. "hierarchical = true" is something I'd probably mis-spell...
- Would the result also be a
Dict{String, Any}? Would you have options for anOrderedDictfrom OrderedCollections.jl or aDictionaryfrom Dictionaries.jl?
I actuallly wrote it for OrderedDict, but it could mirror the current load function's sink / typeflag: julia> load("track_order.h5"; dict=OrderedDict())
One of my reservations about this is that it seems to encourage the user to read the entire file at once, and I would tend to encourage the user to use lazy interfaces for this purpose.
Probably best practice, especially if data volumes are large. In my case, I want to mutate the structure without risking modifying the file contents. Based on the example you provided it is read only. In the past (in other languages and data types) i got into a habit of not leaving files "open".
The nested dictionary behavior that you describe seems closely resemble the current dictionary-like interface. The only difference here is the final
[]to actually access the data. Perhaps your eagerDictthe way to access the contents ofA1would be justh5f["groupA"]["A1"]?
That's interesting. so h5f["groupA"]["A1"][] is equivalent to read(h5f["groupA"], "A1") (but only for datasets not groups)... I didn't find that method in the documentation. I've gotten used to something similar working with observables in Makie, but it's still a little strange to me. I don't know what it means, i just know its how i access that type. Maybe its a concept from pointers?
So the intent of this would be a different structure for the FileIO.load "high level method".
# current:
h5f = load(file) # creates a flat dict
A1 = h5f["groupA/A1"]
# new
h5f = load(file; nested = true) # creates a nested dict
A1 = h5f["groupA"]["A1"]
How would also then deal with attributes?
I'm not sure. Does the current load function handle attributes?