optmizaton for LRUStoreCache.__contains__
Closes https://github.com/zarr-developers/zarr/issues/295.
As mentioned in https://github.com/zarr-developers/zarr/issues/295, LRUStoreCache.__contains__ made a call to self._keys, which can take long, if for example we have an array with terabytes of data and millions of keys stored on cloud, the call to self._keys will first list all those keys and cache them. This PR removes the self._keys call from __contains__. The call to self._store.keys() still exists in the self.keys method and keys will still be cached, but just not when the __contains__ method is called.
TODO:
- [ ] Add unit tests and/or doctests in docstrings
- [ ] Add docstrings and API docs for any new/modified user-facing classes and functions
- [ ] New/modified features documented in docs/tutorial.rst
- [ ] Changes documented in docs/release.rst
- [ ] Docs build locally (e.g., run
tox -e docs) - [ ] AppVeyor and Travis CI passes
- [ ] Test coverage is 100% (Coveralls passes)
Hello @shikharsg! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2019-06-30 23:02:42 UTC
My immediate thought here: what happens if the underlying store has changed since the last caching of the keys?
My immediate thought here: what happens if the underlying store has changed since the last caching of the keys?
If you see the __delitem__ and the __setitem__ methods of LRUStoreCache, if they are called, the cache is invalidated. Are you talking about the case when some other user of the underlying store, changes the store since the last caching of the keys? In which case I guess that was a problem even before this PR right?
Or is there some special case that I have missed?
OK, OK, I got what this is doing - sorry I didn't immediately click in - this is for calls on __contains__ of the target store, where you haven't explicitly listed keys.
Is there a case for having the contains_cache be size-limited, since we are thinking there may be a very large number of keys?
I am OK with the implementation, but I haven't looked at the tests yet.
(the point about the target storage changing from some other process does stand, but is general to this caching mechanism, so not relevant to this particular change)
I initially did think of limiting the size of contains_cache, but then if you see LRUStoreCache, the keys method is still being cached, just not through the contains methods, but through the keys method, i.e. if a call to _keys is made, ALL the keys will be cached. So I was wondering if it makes sense to put a size limit on the contains cache, when there is no size limit for keys cache. I guess we could:
- put a size limit on both keys cache and contains cache(the default size for both being unlimited to maintain backwards compatibility)
- put a size limit on contains cache and completely remove keys caching
I like the 2nd one, but it would break backwards compatibility so actually I'm not sure which one of the two options we should do.
Hi @shikharsg. Just going through open issues for a review and found this linked from #295. Any thoughts from your side?
This has unfortunately gone stale. Seems like it was quite close. Will it be revived or should we close this?
Closing this out as stale. We'll keep the feature request open.