zarr-python optmizaton for LRUStoreCache.__contains_

Closes https://github.com/zarr-developers/zarr/issues/295.

As mentioned in https://github.com/zarr-developers/zarr/issues/295, LRUStoreCache.__contains__ made a call to self._keys, which can take long, if for example we have an array with terabytes of data and millions of keys stored on cloud, the call to self._keys will first list all those keys and cache them. This PR removes the self._keys call from __contains__. The call to self._store.keys() still exists in the self.keys method and keys will still be cached, but just not when the __contains__ method is called.

TODO:

[ ] Add unit tests and/or doctests in docstrings
[ ] Add docstrings and API docs for any new/modified user-facing classes and functions
[ ] New/modified features documented in docs/tutorial.rst
[ ] Changes documented in docs/release.rst
[ ] Docs build locally (e.g., run tox -e docs)
[ ] AppVeyor and Travis CI passes
[ ] Test coverage is 100% (Coveralls passes)

Jun 30 '19 22:06 shikharsg

Hello @shikharsg! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2019-06-30 23:02:42 UTC

Jun 30 '19 22:06 pep8speaks

My immediate thought here: what happens if the underlying store has changed since the last caching of the keys?

Jun 30 '19 23:06 martindurant

My immediate thought here: what happens if the underlying store has changed since the last caching of the keys?

If you see the __delitem__ and the __setitem__ methods of LRUStoreCache, if they are called, the cache is invalidated. Are you talking about the case when some other user of the underlying store, changes the store since the last caching of the keys? In which case I guess that was a problem even before this PR right?

Or is there some special case that I have missed?

Jun 30 '19 23:06 shikharsg

OK, OK, I got what this is doing - sorry I didn't immediately click in - this is for calls on __contains__ of the target store, where you haven't explicitly listed keys.

Is there a case for having the contains_cache be size-limited, since we are thinking there may be a very large number of keys?

I am OK with the implementation, but I haven't looked at the tests yet.

(the point about the target storage changing from some other process does stand, but is general to this caching mechanism, so not relevant to this particular change)

Jul 01 '19 00:07 martindurant

I initially did think of limiting the size of contains_cache, but then if you see LRUStoreCache, the keys method is still being cached, just not through the contains methods, but through the keys method, i.e. if a call to _keys is made, ALL the keys will be cached. So I was wondering if it makes sense to put a size limit on the contains cache, when there is no size limit for keys cache. I guess we could:

put a size limit on both keys cache and contains cache(the default size for both being unlimited to maintain backwards compatibility)
put a size limit on contains cache and completely remove keys caching

I like the 2nd one, but it would break backwards compatibility so actually I'm not sure which one of the two options we should do.

Jul 01 '19 08:07 shikharsg

Hi @shikharsg. Just going through open issues for a review and found this linked from #295. Any thoughts from your side?

Sep 22 '21 15:09 joshmoore

This has unfortunately gone stale. Seems like it was quite close. Will it be revived or should we close this?

Dec 07 '23 22:12 jhamman

Closing this out as stale. We'll keep the feature request open.

Feb 13 '24 22:02 jhamman

optmizaton for LRUStoreCache.__contains__

Comment last updated at 2019-06-30 23:02:42 UTC

optmizaton for LRUStoreCache.contains