zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Zarr DirectoryStore: getsize is wrong

Open Kriechi opened this issue 7 years ago • 12 comments

https://github.com/zarr-developers/zarr/blob/3c8e9291e96e090e20caf6950da69a3cc180652f/zarr/storage.py#L864-L871

This code does not account for subfolders created by this example:

store = zarr.TempStore(dir='/tmpfs')
for name, values in data.items():
    zarr.array(values, store=store, path=name, compressor=compressor, filters=filter)
total_size = store.getsize()

total_size is 26, (24+2 for .zattrs and .zgroup), which is wrong. Subfolders are not accounted for.

My TempStore has the following structure:

.zattrs
.group
name1/
  1
  2
  3
name2/
  1
  2
  3

The file sizes of name1 and name2 are unaccounted for.

Tested with Zarr 2.2.0.

Kriechi avatar Apr 04 '18 14:04 Kriechi

Thanks Thomas, currently on leave but will follow up next week.

On Wed, 4 Apr 2018, 15:35 Thomas Kriechbaumer, [email protected] wrote:

https://github.com/zarr-developers/zarr/blob/3c8e9291e96e090e20caf6950da69a3cc180652f/zarr/storage.py#L864-L871

This code does not account for subfolders created by this example:

store = zarr.TempStore(dir='/tmpfs')for name, values in data.items(): zarr.array(values, store=store, path=name, compressor=compressor, filters=filter) total_size = store.getsize()

total_size is 26, (24+2 for .zattrs and .zgroup), which is wrong. Subfolders are not accounted for.

My TempStore has the following structure:

.zattrs .group name1/ 1 2 3 name2/ 1 2 3

The file sizes of name1 and name2 are unaccounted for.

Tested with Zarr 2.2.0.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/253, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QlwB9pNbyPcTsylV21CTnmH_XefSks5tlNo8gaJpZM4TG6oy .

alimanfoo avatar Apr 04 '18 18:04 alimanfoo

Just to say the getsize() method was originally added to provide information about stored size of a single array, and so recursing into subdirectories was not needed. However I think it would be better in general if it did report recursive size. I don't have time to work on this right now but a PR would be welcome.

Also if this was changed to recursive size in DirectoryStore then might want to also review implementation in NestedDirectoryStore and ZipStore classes.

alimanfoo avatar Apr 09 '18 13:04 alimanfoo

So I guess NestedDirectoryStore would technically be wrong too or does that already recurse when determining array size.

jakirkham avatar May 10 '18 02:05 jakirkham

Yes NestedDirectoryStore is wrong, only counts immediate children of a directory (actually doesn't override getsize() so uses implementation from parent class DirectoryStore).

alimanfoo avatar May 10 '18 12:05 alimanfoo

It would be pretty simple to change this to os.walk.

If we want to get fancier, we use scandir, which includes the size while walking the tree. There's a backport package for Python 2 as well.

ref: https://stackoverflow.com/a/1392549

jakirkham avatar Dec 04 '18 07:12 jakirkham

PR ( https://github.com/zarr-developers/zarr/pull/362 ) should fix this. Just used os.walk for simplicity given we need to go through the directory recursively and scandir behaves a bit more like listdir, which would complicate this a bit.

jakirkham avatar Dec 10 '18 19:12 jakirkham

When looking into this further, it seems that getsize's behavior is currently expected to ignore nested children. This is tested here (note this should be 15 instead of 6 if recursion is used) as well as on the following lines. This test is run for all stores (e.g. DictStore also has this behavior). While I agree it probably should change to be recursive, that's a behavior change. So it would be good to make sure we are not missing something before changing this.

jakirkham avatar Dec 10 '18 20:12 jakirkham

@alimanfoo, do you have any thoughts on the comment above? Basically the current behavior seems intentional (as it is tested) so am trying to understand how we should proceed here.

jakirkham avatar Feb 14 '19 07:02 jakirkham

Sorry for slow follow up. FWIW I think it would be reasonable to change the behaviour and make it recursive. Should not affect results for reporting stored size of an array in a DirectoryStore, as there shouldn't be any sub-directories there anyway. (E.g., I often call .info on an array to see how big it is on disk.) I'm happy for the tests to be modified.

alimanfoo avatar Feb 19 '19 17:02 alimanfoo

Sorry I missed your comment, @alimanfoo. Thanks for the clarification. Happy to break the current tests as well. Just wanted to make sure there wasn't something I was missing. Will try to update PR ( https://github.com/zarr-developers/zarr/pull/362 ) when I have a chance.

jakirkham avatar Feb 26 '19 08:02 jakirkham

Is the only way to get the current estimated size by parsing the returned value of info_items in our own dictionary then indexing No. bytes stored?

hmaarrfk avatar Jul 14 '19 18:07 hmaarrfk

Dear all,

Are there any updates on this functionality (getting size of subfolders and arrays recursively)?

Thank you.

aliaksei-chareshneu avatar Feb 15 '24 13:02 aliaksei-chareshneu