Added Store.getsize
Closes https://github.com/zarr-developers/zarr-python/issues/2420
One difference from Zarr v2, its getsize seemed to return -1 if the concrete backend didn't provide a getsize method. I think returning a "bad" integer like from a function that returns integers is dangerous. I've implemented a slow but correct default that just reads the object and calls len on the bytes.
[Description of PR]
TODO:
- [x] Add unit tests and/or doctests in docstrings
- [x] Add docstrings and API docs for any new/modified user-facing classes and functions
- [ ] New/modified features documented in docs/tutorial.rst
- [ ] Changes documented in docs/release.rst
- [ ] GitHub Actions have all passed
- [ ] Test coverage is 100% (Codecov passes)
I might be confusing myself, but I think this implementation might not be what we want... I think what users want (like us in https://github.com/zarr-developers/zarr-python/pull/2400) is the size of an Array in storage, not the size of a particular key. I guess we could do something like generate all the keys for a given array and then call store.getsize with each of those keys...
So maybe we do need this, since the store is knows (or can figure out) what bytes are actually stored for a given array. But we also need a bit on top of it to bring it to the array level.
I guess we could do something like generate all the keys for a given array and then call
store.getsizewith each of those keys...
In case you want to go this direction, this method is designed for exactly such a use case
I'll throw an idea into the mix. We probably want two things:
Store.get_size(key: str) -> intStore.get_size_dir(path: str) -> int
Of course, list_dir + get_size would be the same as get_size_dir but some stores will be able to provide a fast path for the dir size.
Thanks. Looking at how Icechunk would implement getsize is what prompted my question so I can se how a getsize_dir makes sense there.
Would you expect the size of metadata documents to show up in total for getsize_dir?
I've pushed an update that adds a getsize_prefix, but am having second thoughts about whether this is worth adding to the API. It's not clear to me that a Store will always have a well-defined size for an array (things like references or store-level sharding complicate things), and so maybe it doesn't make sense to add it to the store interface.
am having second thoughts about whether this is worth adding to the API.
Something like this interface
Store.get_size(key: str) -> int
would be very useful for virtualizarr, as then we can easily and efficiently learn the byte range lengths of all objects in a store, in order to ingest existing zarr as virtual zarr.
EDIT: xref https://github.com/zarr-developers/VirtualiZarr/issues/262#issuecomment-2432737281
ah, we are getting some test failures after bringing in the latest changes from main
Should be all set now.