HDF5.jl icon indicating copy to clipboard operation
HDF5.jl copied to clipboard

Support for DiskArrays

Open meggart opened this issue 4 years ago • 6 comments

I want to start a discussion if implementing the DiskArrays interface would make sense for this package. Making HDF5Dataset a subtype of AbstractDiskArray would have the benefit of out-sourcing Base-conformant indexing rules and providing nice features like views, reductions, lazy broadcasting etc.

However, it would probably break some old code, that relies on the old behavior. As an alternative, users can simply wrap HDF5 Datasets into a DiskArray by themselves:

using HDF5, DiskArrays
import DiskArrays: eachchunk, haschunks, readblock!, writeblock!, GridChunks, Chunked, Unchunked

struct HDF5DiskArray{T,N,CS} <: AbstractDiskArray{T,N}
  ds::HDF5Dataset
  cs::CS
end
Base.size(x::HDF5DiskArray) = size(x.ds)
haschunks(x::HDF5DiskArray{<:Any,<:Any,Nothing}) = Chunked()
haschunks(x::HDF5DiskArray) = Unchunked()
eachchunk(x::HDF5DiskArray{<:Any,<:Any,<:GridChunks}) = x.cs
readblock!(x::HDF5DiskArray, aout, r::AbstractUnitRange...) = aout .= x.ds[r...]
writeblock!(x::HDF5DiskArray, v, r::AbstractUnitRange...) = x.ds[r...] = v
function HDF5DiskArray(ds::HDF5Dataset)
    cs = try
        GridChunks(ds, get_chunk(ds))
    catch
        nothing
    end
    HDF5DiskArray{eltype(ds),ndims(ds),typeof(cs)}(ds,cs)
end

Now you can wrap a HDF5Dataset into a DiskArray:

f = h5open("chunk_test.h5","w")

A = rand(100,100)
f["A", "chunk", (5,5)] = A
d = HDF5DiskArray(f["A"])

and the following will operate chunk by chunk and be much more efficient than using the AbstractArray interface:

using Statistics
#Reducing over datasets is done chunk by chunk
mean(d)

#Broadcasting respects chunks
d .= 1

#Reductions over dimensions are ok as well
mean(d, dims=2)

So, maybe it would be a better option to create HDF5Plus.jl where we define this wrapper and users can decide between which package to use. What do the maintainers of HDF5 think?

meggart avatar Apr 02 '20 13:04 meggart

See HDF5Utils.jl

AStupidBear avatar Apr 09 '20 17:04 AStupidBear

Thanks @AStupidBear If you have any questions or you stumble over issues with DiskArrays, feel free to open an issue.

meggart avatar Apr 09 '20 20:04 meggart

@AStupidBear do you think we should try to merge the two repos?

musm avatar Apr 15 '20 17:04 musm

Overall this seems like a great idea. We should think about how to proceed with the approach to integrate everything for better end user experience.

musm avatar Apr 15 '20 17:04 musm

@musm I'd like to merge HDFUtils into HDF5

HDF5Utils containers the following features:

  • loading and saving Virtual Dataset and a VirtualLayout for construction
  • suppressing error messages, setting alignment
  • utility functions to concatenate hdf5 files into a single one or into a virtual dataset efficiently
  • supporting DiskArrays and a naive caching mechanism for fast sequential scaler getindex
  • loading and saving MaxLenString{N} and Array{<:MaxLenString{N}}
  • loading and saving HDF5Compound as NamedTuple and arrays of NamedTuple

The first 4 features are good to be merged to HDF5.jl, but the last two is controversial. Does the HDF5.jl team support this design (namedtuple) ?

In my use cases, I need to first write a dataset of HDF5Compound type and read it back in another process by readmmap. The last two features are designed for this.

AStupidBear avatar Apr 16 '20 06:04 AStupidBear

Good idea. I think it's best to proceed with very minimal updates first, esp. non-breaking ones. We can probably merge those quickly.

musm avatar Apr 17 '20 15:04 musm