HDF5.jl icon indicating copy to clipboard operation
HDF5.jl copied to clipboard

Loading (parts of) HDF5 datasets into existing arrays?

Open oschulz opened this issue 7 years ago • 6 comments

From what I can see, there's currently no easy way to load an HDF5 dataset (or parts of it) into an existing array (to avoid memory allocation/ GC costs, esp. in multi-threaded applications). We could provide methods for setindex! and view to support for something like

target[:] = ds

and (to read a fragment of a DS into a fragment of an array)

target[a:b, ...] = view(ds, c:d, ...)

oschulz avatar May 23 '17 08:05 oschulz

Something like

similar(::Type{Array}, ds::HDF5Dataset)

might also come in handy in that context.

oschulz avatar May 23 '17 08:05 oschulz

Many data sets (e.g. Unchunked, uncompressed ones) will be memory mapped, in which case the OS is already providing a view of the data on disk.

Still, something to cover the general case would be awesome.

andyferris avatar May 23 '17 08:05 andyferris

Many data sets (e.g. Unchunked, uncompressed ones) will be memory mapped

Sure - but I'll have to deal with large files with chunked and compressed datasets quite often (and therefore, have to process them out-of-core, in chunks). If people are fine with the proposal above, I can implement it and do a PR. I just wanted to gauge acceptance, first.

oschulz avatar May 23 '17 09:05 oschulz

It seems very reasonable to me. Perhaps an AbstractArray type which is the full view of the dataset, and use e.g. SubArray for subsets/views thereof?

andyferris avatar May 23 '17 10:05 andyferris

If you need to keep track of which chunk you're working on, but still read only small chunks (i.e., never have the "full view" available), then an OffsetArray is handy:

function read_chunk!(buf, ds, inds...)
    buf[:] = view(ds, inds...)
    OffsetArray(buf, inds)
end

timholy avatar May 23 '17 10:05 timholy

... then an OffsetArray is handy

Thanks, Tim, that could indeed come in very useful. To what extent are arrays with not-one-based indexing are supported by Base and common packages, currently?

oschulz avatar May 23 '17 14:05 oschulz