zarr-python
zarr-python copied to clipboard
[WIP] Attempt to continue LRU cache for decoded chunks
Continuation attempt of #306 - do not merge yet - ~very likely still not correct behaviour when chunks get deleted from store by #738~.
Still have very limited knowledge of the code base - will require me to dig a bit deeper to gain bit more understanding.
TODO:
- [ ] Add unit tests and/or doctests in docstrings
- [ ] Add docstrings and API docs for any new/modified user-facing classes and functions
- [ ] New/modified features documented in docs/tutorial.rst
- [ ] Changes documented in docs/release.rst
- [ ] GitHub Actions have all passed
- [ ] Test coverage is 100% (Codecov passes)
~Hmh, ok - the diff of the merge commit looks still a bit insane - some code from other PRs somehow leaked in/got duplicated - or~ I am just bad at interpreting the diff. ~Will investigate tomorrow what happened there.~
Kicked off the GHA workflows.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.99%. Comparing base (
f361631) to head (16793f5). Report is 720 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1214 +/- ##
========================================
Coverage 99.99% 99.99%
========================================
Files 35 35
Lines 14136 14366 +230
========================================
+ Hits 14135 14365 +230
Misses 1 1
| Files with missing lines | Coverage Δ | |
|---|---|---|
| zarr/__init__.py | 100.00% <ø> (ø) |
|
| zarr/core.py | 100.00% <100.00%> (ø) |
|
| zarr/creation.py | 100.00% <ø> (ø) |
|
| zarr/hierarchy.py | 99.78% <100.00%> (+<0.01%) |
:arrow_up: |
| zarr/storage.py | 100.00% <100.00%> (ø) |
|
| zarr/tests/test_core.py | 100.00% <100.00%> (ø) |
|
| zarr/tests/test_storage.py | 100.00% <100.00%> (ø) |
Green. Are there any other commits expected from your side, @croth1?
I am still not super sure whether this is the final form or I would rather re-implement it as a caching store ontop of other stores, just like the LRUStoreCache - it feels a bit more intuitive to me. Also I still want to check whether I can do write caching. If I remember correctly, this implementation is write-through. Not sure whether in combination with LRUStoreCache cached writes can be achieved. Need to do more research, but I'm a bit busy right now.
Hi, any news on when this feature will get pushed to main? It seems like a very useful feature! I want to reduce the overhead of sampling from the same chunk which is expected to be in RAM. If I'm not mistaken, right now, every time you pass an index to the zarr array, it has to decompress that index. I want to have an option to keep the entire chunk decompressed in the cache until it's discarded. Specifically, I'm using a large zarr array in my pytorch dataset module, and need to randomly iterate through samples in the same chunk before randomly picking the next chunk. Avoiding the decompression overhead is quite useful, and unfortunately I can't leave the data fully decompressed on disk, as it would be too large.
@FarzanT @joshmoore, I have in the meantime changed my approach and would not need this anymore - hence the long silence. If that's useful for other people, we can try finishing it up as is. Would need some rebasing and performance testing, though.
IIRC, last time I checked all tests were green, although this would need a very thorough review because frequently I was not 100% sure what I was doing during the quite substantial rebase.
I'll leave @FarzanT and others to say how pressing their need is. Happy to help how I can @croth1 if you'd like to pursue this.
Thank you @croth1 and @joshmoore, my primary use case has also been addressed by https://github.com/zarr-developers/zarr-python/issues/278#issuecomment-906527765, so it's not a pressing issue for me at the moment. But I'd say it would be much nicer to just flip a switch and have zarr handle this internally. If this pull request can address this, then it shouldn't be abandoned IMO!
I'm going to close this as stale. Folks should feel free to reopen if there is interest in continuing this work.