DiskArrays.jl
DiskArrays.jl copied to clipboard
Create an additional interface for defining chunks for asynchronous IO
I recently started thinking about implementing the v3 of the zarr specs, which also include sharding as an extension, in which several chunks are stored into a single storage unit. One problem that arises immediately is that when shards are present, chunks in the same shard can not be written to from different threads/tasks to avoid data corruption, so ideally downstream applications should be able to query sharding structure of an AbstractDiskArray. Handling zarr files is not the only situation, in which chunks do not align with storage units. For example it is quite common to concatenate a list of NetCDF/Tiff files using ConcatDiskArray
. Then when writing to this array, parallel writes are allowed as long as they end up in different files, but these storage units are different from the actual chunks of the array which rather represents compression units.
So my suggestion would be to add something like a hasioblocks
and eachioblock
function to DiskArrays which DiskArray implementers can potentially extend. The point of the function would be that it returns an iterator of blocks chunk indices that belong to the same storage unit and where members of different storage blocks can be safely mutated in parallel.
This would be orthogonal to the haschunks
and eachchunk
pair. For example a single NetCDF or HDF5 file may have chunks internally but consists only of a single storage unit, so eachioblock
would only return a length 1 iterator. For traditional Zarr arrays where every chunk is a separate file, eachioblock
would have same length as eachchunk
and of course there would be mixed situations for somehow sharded arrays.
I would be very happy if people have comments or naming suggestions or alternative ideas here. @mkitti had some ideas on the sharding topic as well, I would very much appreciate your opinion.