HDF5.jl
HDF5.jl copied to clipboard
Support for DiskArrays
I want to start a discussion if implementing the DiskArrays
interface would make sense for this package. Making HDF5Dataset
a subtype of AbstractDiskArray
would have the benefit of out-sourcing Base-conformant indexing rules and providing nice features like views, reductions, lazy broadcasting etc.
However, it would probably break some old code, that relies on the old behavior. As an alternative, users can simply wrap HDF5 Datasets into a DiskArray by themselves:
using HDF5, DiskArrays
import DiskArrays: eachchunk, haschunks, readblock!, writeblock!, GridChunks, Chunked, Unchunked
struct HDF5DiskArray{T,N,CS} <: AbstractDiskArray{T,N}
ds::HDF5Dataset
cs::CS
end
Base.size(x::HDF5DiskArray) = size(x.ds)
haschunks(x::HDF5DiskArray{<:Any,<:Any,Nothing}) = Chunked()
haschunks(x::HDF5DiskArray) = Unchunked()
eachchunk(x::HDF5DiskArray{<:Any,<:Any,<:GridChunks}) = x.cs
readblock!(x::HDF5DiskArray, aout, r::AbstractUnitRange...) = aout .= x.ds[r...]
writeblock!(x::HDF5DiskArray, v, r::AbstractUnitRange...) = x.ds[r...] = v
function HDF5DiskArray(ds::HDF5Dataset)
cs = try
GridChunks(ds, get_chunk(ds))
catch
nothing
end
HDF5DiskArray{eltype(ds),ndims(ds),typeof(cs)}(ds,cs)
end
Now you can wrap a HDF5Dataset into a DiskArray:
f = h5open("chunk_test.h5","w")
A = rand(100,100)
f["A", "chunk", (5,5)] = A
d = HDF5DiskArray(f["A"])
and the following will operate chunk by chunk and be much more efficient than using the AbstractArray interface:
using Statistics
#Reducing over datasets is done chunk by chunk
mean(d)
#Broadcasting respects chunks
d .= 1
#Reductions over dimensions are ok as well
mean(d, dims=2)
So, maybe it would be a better option to create HDF5Plus.jl where we define this wrapper and users can decide between which package to use. What do the maintainers of HDF5 think?
See HDF5Utils.jl
Thanks @AStupidBear If you have any questions or you stumble over issues with DiskArrays, feel free to open an issue.
@AStupidBear do you think we should try to merge the two repos?
Overall this seems like a great idea. We should think about how to proceed with the approach to integrate everything for better end user experience.
@musm I'd like to merge HDFUtils into HDF5
HDF5Utils containers the following features:
- loading and saving
Virtual Dataset
and aVirtualLayout
for construction - suppressing error messages, setting alignment
- utility functions to concatenate hdf5 files into a single one or into a virtual dataset efficiently
- supporting DiskArrays and a naive caching mechanism for fast sequential scaler
getindex
- loading and saving
MaxLenString{N}
andArray{<:MaxLenString{N}}
- loading and saving
HDF5Compound
asNamedTuple
and arrays ofNamedTuple
The first 4 features are good to be merged to HDF5.jl, but the last two is controversial. Does the HDF5.jl team support this design (namedtuple) ?
In my use cases, I need to first write a dataset of HDF5Compound
type and read it back in another process by readmmap
. The last two features are designed for this.
Good idea. I think it's best to proceed with very minimal updates first, esp. non-breaking ones. We can probably merge those quickly.