HDF5.jl
HDF5.jl copied to clipboard
Loading (parts of) HDF5 datasets into existing arrays?
From what I can see, there's currently no easy way to load an HDF5 dataset (or parts of it) into an existing array (to avoid memory allocation/ GC costs, esp. in multi-threaded applications). We could provide methods for setindex!
and view
to support for something like
target[:] = ds
and (to read a fragment of a DS into a fragment of an array)
target[a:b, ...] = view(ds, c:d, ...)
Something like
similar(::Type{Array}, ds::HDF5Dataset)
might also come in handy in that context.
Many data sets (e.g. Unchunked, uncompressed ones) will be memory mapped, in which case the OS is already providing a view of the data on disk.
Still, something to cover the general case would be awesome.
Many data sets (e.g. Unchunked, uncompressed ones) will be memory mapped
Sure - but I'll have to deal with large files with chunked and compressed datasets quite often (and therefore, have to process them out-of-core, in chunks). If people are fine with the proposal above, I can implement it and do a PR. I just wanted to gauge acceptance, first.
It seems very reasonable to me. Perhaps an AbstractArray
type which is the full view of the dataset, and use e.g. SubArray
for subsets/views thereof?
If you need to keep track of which chunk you're working on, but still read only small chunks (i.e., never have the "full view" available), then an OffsetArray
is handy:
function read_chunk!(buf, ds, inds...)
buf[:] = view(ds, inds...)
OffsetArray(buf, inds)
end
... then an OffsetArray is handy
Thanks, Tim, that could indeed come in very useful. To what extent are arrays with not-one-based indexing are supported by Base and common packages, currently?