zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Sharding array chunks across hashed sub-directories

Open shoyer opened this issue 3 years ago • 4 comments

Consider the case where we want to concurrently store and read many array chunks (e.g., millions). This is inherently pretty reasonable with many distributed storage systems, but not with Zarr's default keys for chunks of the form {array_name}/{i}.{j}.{k}:

  • Distributed filesystems like Lustre recommend against storing more than thousands of files in a single directory: https://www.nas.nasa.gov/hecc/support/kb/lustre-best-practices_226.html
  • AWS S3 suggests splitting requests across multiple "prefixes" (aka directories): https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
  • Google Cloud Storage recommends avoiding sequential file names: https://cloud.google.com/storage/docs/request-rate

If we store a 10 TB array as a million 10 MB files in a single directory with sequential names, it would violate all of these guidelines!

It seems like a better strategy would be to store array chunks (and possibly other Zarr keys) across multiple sub-directories, where the sub-directory name is some apparently random but deterministic function of the keys, e.g., of the form {array_name}/{hash_value}/{i}.{j}.{k} or {hash_value}/{array_name}/{i}.{j}.{k}, where hash_value is produced by applying any reasonable hash function to the original key {array_name}/{i}.{j}.{k}.

The right number of hash buckets would depend on the performance characteristics of the underlying storage system. But the key feature is that the random prefixes/directory names make it easier to shard load, and avoid the typical performance bottleneck of reading/writing a bunch of nearby keys at the same time.

Ideally the specific hashing function / naming scheme (including the number of buckets) would be stored as part of the Zarr metadata in a standard way, so as to facilitate reading/writing data with different implementations.

Any thoughts? Has anyone considered these sort of solutions, or encountered these scaling challenges in the wild? I don't quite have a strong need for this feature yet, but I imagine I may soon.

shoyer avatar May 09 '21 21:05 shoyer