zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Added Store.getsize

Open TomAugspurger opened this issue 1 year ago • 5 comments

Closes https://github.com/zarr-developers/zarr-python/issues/2420

One difference from Zarr v2, its getsize seemed to return -1 if the concrete backend didn't provide a getsize method. I think returning a "bad" integer like from a function that returns integers is dangerous. I've implemented a slow but correct default that just reads the object and calls len on the bytes.

[Description of PR]

TODO:

  • [x] Add unit tests and/or doctests in docstrings
  • [x] Add docstrings and API docs for any new/modified user-facing classes and functions
  • [ ] New/modified features documented in docs/tutorial.rst
  • [ ] Changes documented in docs/release.rst
  • [ ] GitHub Actions have all passed
  • [ ] Test coverage is 100% (Codecov passes)

TomAugspurger avatar Oct 21 '24 12:10 TomAugspurger

I might be confusing myself, but I think this implementation might not be what we want... I think what users want (like us in https://github.com/zarr-developers/zarr-python/pull/2400) is the size of an Array in storage, not the size of a particular key. I guess we could do something like generate all the keys for a given array and then call store.getsize with each of those keys...

So maybe we do need this, since the store is knows (or can figure out) what bytes are actually stored for a given array. But we also need a bit on top of it to bring it to the array level.

TomAugspurger avatar Oct 23 '24 20:10 TomAugspurger

I guess we could do something like generate all the keys for a given array and then call store.getsize with each of those keys...

In case you want to go this direction, this method is designed for exactly such a use case

d-v-b avatar Oct 23 '24 20:10 d-v-b

I'll throw an idea into the mix. We probably want two things:

  • Store.get_size(key: str) -> int
  • Store.get_size_dir(path: str) -> int

Of course, list_dir + get_size would be the same as get_size_dir but some stores will be able to provide a fast path for the dir size.

jhamman avatar Oct 23 '24 21:10 jhamman

Thanks. Looking at how Icechunk would implement getsize is what prompted my question so I can se how a getsize_dir makes sense there.

Would you expect the size of metadata documents to show up in total for getsize_dir?

TomAugspurger avatar Oct 24 '24 12:10 TomAugspurger

I've pushed an update that adds a getsize_prefix, but am having second thoughts about whether this is worth adding to the API. It's not clear to me that a Store will always have a well-defined size for an array (things like references or store-level sharding complicate things), and so maybe it doesn't make sense to add it to the store interface.

TomAugspurger avatar Oct 24 '24 16:10 TomAugspurger

am having second thoughts about whether this is worth adding to the API.

Something like this interface

Store.get_size(key: str) -> int

would be very useful for virtualizarr, as then we can easily and efficiently learn the byte range lengths of all objects in a store, in order to ingest existing zarr as virtual zarr.

EDIT: xref https://github.com/zarr-developers/VirtualiZarr/issues/262#issuecomment-2432737281

TomNicholas avatar Nov 05 '24 17:11 TomNicholas

ah, we are getting some test failures after bringing in the latest changes from main

d-v-b avatar Nov 14 '24 15:11 d-v-b

Should be all set now.

TomAugspurger avatar Nov 14 '24 16:11 TomAugspurger