HDF5.jl icon indicating copy to clipboard operation
HDF5.jl copied to clipboard

Nested Dictionaries

Open DanDeepPhase opened this issue 10 months ago • 3 comments

My mental model of HDF5s is as a folder structure, where related data is grouped together, and buried in a nested / hierarchical format. Currently the read functions deliver a flat dictionary, and the hierarchy is held in strings as opposed to structure. The alternative which matches my mental model is to read an HDF5 in as a nested dictionary, where the value of a key is a datatype if the key refers to a datatype, and the value is a dictionary if the key refers to a group.

So for an HDF5 like:

📂 h5file
├─ 🔢 B
└─ 📂 groupA
       ├─ 🔢 A1
       └─ 🔢 A2

The current read generates:

Dict(
   "B" => Bval
   "groupA/A1" => A1val
   "groupA/A2" => A2val
)

And I'd prefer an option to read_nested as:

Dict(
   "B" => Bval
   "groupA" => Dict(
          "A1" => A1val
          "A2" => A2val
         )
)

I've written this code locally (plus corresponding write_nested. Would it be reasonable to include it here?

DanDeepPhase avatar Jan 26 '25 18:01 DanDeepPhase

I have several questions:

  1. What is the function that you are using to "read" the file which creates the Dict{String, Any}? It appears you might be using FileIO.load.
  2. Is there a reason we need to create new functions? An alternative would be to add keyword arguments to FileIO.load.
  3. Would the result also be a Dict{String, Any}? Would you have options for an OrderedDict from OrderedCollections.jl or a Dictionary from Dictionaries.jl?

mkitti avatar Jan 28 '25 05:01 mkitti

One of my reservations about this is that it seems to encourage the user to read the entire file at once, and I would tend to encourage the user to use lazy interfaces for this purpose.

The nested dictionary behavior that you describe seems closely resemble the current dictionary-like interface.

julia> h5f = h5open("test.h5")
🗂️ HDF5.File: (read-only) test.h5
├─ 🔢 B
└─ 📂 groupA
   ├─ 🔢 A1
   └─ 🔢 A2

julia> h5f["groupA"]["A1"][]
2×2 Matrix{Float64}:
 0.43893   0.583493
 0.546226  0.652598

The only difference here is the final [] to actually access the data. Perhaps your eager Dict the way to access the contents of A1 would be just h5f["groupA"]["A1"]?

How would also then deal with attributes?

mkitti avatar Jan 28 '25 05:01 mkitti

Thanks for the feedback! Here are my replies:

  1. What is the function that you are using to "read" the file which creates the Dict{String, Any}? It appears you might be using FileIO.load.

I wrote a custom read function which uses h5open. the reader code is as follows (writing code is similar). It's not too different from the code in FileIOExt.jl, just a different output:

function load_nested(filename)
    h5open(filename) do fid
        read_group(fid)
    end
end

function read_group(parent)
    d = OrderedDict{String,Any}()
    for key in keys(parent)
        content = read_dataset(parent[key])
        merge!(d,Dict(key => content))
    end
    d
end

read_dataset(val::HDF5.Group) = read_group(val)
read_dataset(val::HDF5.Dataset) = read(val)
  1. Is there a reason we need to create new functions? An alternative would be to add keyword arguments to FileIO.load.

I would prefer not to create a new function. My new functions are just my local solution to avoid type piracy. So load_nested(file) could be load(file; nested=true) or similar. "hierarchical = true" is something I'd probably mis-spell...

  1. Would the result also be a Dict{String, Any}? Would you have options for an OrderedDict from OrderedCollections.jl or a Dictionary from Dictionaries.jl?

I actuallly wrote it for OrderedDict, but it could mirror the current load function's sink / typeflag: julia> load("track_order.h5"; dict=OrderedDict())

One of my reservations about this is that it seems to encourage the user to read the entire file at once, and I would tend to encourage the user to use lazy interfaces for this purpose.

Probably best practice, especially if data volumes are large. In my case, I want to mutate the structure without risking modifying the file contents. Based on the example you provided it is read only. In the past (in other languages and data types) i got into a habit of not leaving files "open".

The nested dictionary behavior that you describe seems closely resemble the current dictionary-like interface. The only difference here is the final [] to actually access the data. Perhaps your eager Dict the way to access the contents of A1 would be just h5f["groupA"]["A1"]?

That's interesting. so h5f["groupA"]["A1"][] is equivalent to read(h5f["groupA"], "A1") (but only for datasets not groups)... I didn't find that method in the documentation. I've gotten used to something similar working with observables in Makie, but it's still a little strange to me. I don't know what it means, i just know its how i access that type. Maybe its a concept from pointers?

So the intent of this would be a different structure for the FileIO.load "high level method".

# current:
h5f = load(file)                 # creates a flat dict
A1 = h5f["groupA/A1"]

# new 
h5f = load(file; nested = true)  # creates a nested dict
A1 = h5f["groupA"]["A1"]

How would also then deal with attributes?

I'm not sure. Does the current load function handle attributes?

DanDeepPhase avatar Jan 28 '25 21:01 DanDeepPhase