zarr-python LRU cache for decoded chunks

Issue: #278

The ChunkCache implemented here is an almost identical copy of the LRUStoreCache class, but with the keys cache functionality removed(same for its tests).

I was hesitant to implement it as a super class of LRUStoreCache as the two are supposed to be used in different places. While LRUStoreCache is like any other storage class of zarr and can be wrapped around any storage class of zarr, ChunkCache is specifically to be used for storage of decoded chunks, and to be passed to an Array.

I have implemented both read-write cache, i.e. in addition to caching when one reads from the array, if one writes to a certain chunk, instead of invalidating that chunk in the ChunkCache(which would happen if one wanted only read cache), that chunk being written to is cached in the ChunkCache object. There is a problem with this approach as the below 3 tests are failing. If we first write data to zarr array, and then we read it, and when not using ChunkCache(as implemented above) it always goes through the encode phase, and then the decode phase when we write and subsequently read from the array. If ChunkCache is used, with write cache as implemented here, it's possible that it does not go through the encode-decode phase when we write and subsequently read the data. Now for the following three tests to pass, it is imperative that it go through the encode-decode phase when we write and then subsequently read from the array:

Tracebacks can be seen in Travis CI

I wonder if there is a way to get around this while keeping write cache implemented, which makes me think if I should implement write cache at all? The above 3 tests will pass if we implement only read cache.

Finally, the logic of chunk cache as implemented in the _chunk_getitem function is making that function much more hard to read than it already was. I could refactor for better readability if it is so desired.

TODO:

[x] Add unit tests and/or doctests in docstrings
[x] Unit tests and doctests pass locally under Python 3.6 (e.g., run tox -e py36 or pytest -v --doctest-modules zarr)
[x] Unit tests pass locally under Python 2.7 (e.g., run tox -e py27 or pytest -v zarr)
[x] PEP8 checks pass (e.g., run tox -e py36 or flake8 --max-line-length=100 zarr)
[x] Add docstrings and API docs for any new/modified user-facing classes and functions
[x] New/modified features documented in docs/tutorial.rst
[x] Doctests in tutorial pass (e.g., run tox -e py36 or python -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst)
[ ] Changes documented in docs/release.rst
[x] Docs build locally (e.g., run tox -e docs)
[x] AppVeyor and Travis CI passes
[x] Test coverage to 100% (Coveralls passes)

Oct 11 '18 11:10 shikharsg

Regarding the msgpack failure, maybe that test needs to be skipped for the array tests with chunk cache, the behaviour with msgpack is broken and in fact correct when using a chunk cache.

Regarding the test failure about mismatching dtypes, I don't know what's causing that. Maybe worth revising the implementation of _chunk_getitem() as suggested above and seeing if it persists.

Seems worth pursuing the support for write cache implementation, should be a way to deal with these test issues.

Oct 11 '18 16:10 alimanfoo

Thanks for the review. Very helpful.

For fixing TestArrayWithChunkCache.test_dtypes, I have made a special case in the buffer_size function to detect 'Mm' dtypes, as you suggested.

The reason for the TestArrayWithChunkCache.test_object_arrays_vlen_array failure is slightly subtle. In that test we intend to apply this filter: VLenArray('<u4'). Now as you can see in the below snippet(taken from here), the dtype of the variable length arrays will not change until encode-decode happens:

>>> import numcodecs
>>> import numpy as np
>>> x = np.array([[1, 3, 5], [4], [7, 9]], dtype='object')
>>> codec = numcodecs.VLenArray('<i4')
>>> codec.decode(codec.encode(x))
array([array([1, 3, 5], dtype=int32), array([4], dtype=int32),
       array([7, 9], dtype=int32)], dtype=object)

When caching during _chunk_setitem, the VLenArray('<u4') filter is not applied and encode-decode does not happen, therefore the test failure. Only workaround to this I can think of is to apply encode then decode to the chunk when write caching and when the filter is VLenArray. See my comment above in the _chunk_setitem_nosync function.

Will do the relevant changes in docs and class naming shortly.

Oct 13 '18 10:10 shikharsg

Thanks @shikharsg. The test_object_arrays_vlen_array situation is tricky. It exposes a potential difficulty in general with object arrays and caching on write. I can see a few possible options at the moment...

Option 1a. Don't support caching on write at all.

Option 1b. Don't support caching on write for object arrays.

Option 2. Relax the test.

Option 3. Round-trip the chunk through encode/decode before caching.

FWIW I would be fine with option 1a or 1b if you want to get the read caching in now and leave write caching for later. The primary use cases Zarr targets are write-once-read-many so read caching is the priority.

I think I'd also be OK with option 2. However, there is another issue that I just realised would need to be addressed, and that is that during write, saving the chunk to the cache occurs too early, and there is a possibility that the chunk could get cached but a failure could subsequently occur either during encoding or storing. This could arise, for example, if an object array is passed but that cannot be encoded (e.g., as a vlen array). This could be addressed by moving the call to set the chunk in the cache to be the very last statement inside the method. I.e., the end of Array._chunk_setitem_nosync() becomes:

        # encode chunk
        cdata = self._encode_chunk(chunk)

        # store
        self.chunk_store[ckey] = cdata

        # cache the chunk
        if self._chunk_cache is not None:
            self._chunk_cache[ckey] = np.copy(chunk)

Option 3 also seems OK for object arrays. There is a way this could be done that would handle a wider range of scenarios for object arrays, if the end of Array._chunk_setitem_nosync() becomes:

        # encode chunk
        cdata = self._encode_chunk(chunk)

        # store
        self.chunk_store[ckey] = cdata

        # cache the chunk
        if self._chunk_cache is not None:
            if self._dtype == object:
                # ensure cached chunk has been round-tripped through encode/decode
                chunk = self._decode_chunk(cdata)
            self._chunk_cache[ckey] = np.copy(chunk)

Note that this still would mean that using a chunk cache would slow down writing of object arrays, because it would incur an extra decode as chunks are written.

Oct 14 '18 18:10 alimanfoo

I have moved write cache to the very last statement inside the method. I have also implemented caching for 0-d arrays

About the test_object_arrays_vlen_array problem, option 3(but just for dtype=object arrays) seems best to me and which I have also implemented as you have suggested. We might want to document the slowdown for write-cache object arrays somewhere. Which part of the docs should I put this in?

Oct 15 '18 08:10 shikharsg

After getting a few bugs in LRUChunkCache and seeing that it didn't have enough tests I figured that some of the tests from the base class StoreTests could be used for LRUChunkCache, but not all tests. So I separated the methods of StoreTests into two classes MutableMappingStoreTests and StoreTests, the latter inheriting the former. StoreTests, in addition to the methods of MutableMappingStoreTests , has test_hierarchy and the test_init* methods. Also LRUChunkCache inherits MutableMappingStoreTests. Doing this also let me easily fix a few bugs in LRUChunkCache. I hope this is not too big of a change and if you would like to shift any methods from StoreTests to MutableMappingStoreTests or vice versa, or even entirely revert this change do let me know.

I would like to work the the _chunk_getitem() simplification for a little more time, if that's okay. Will get back on this soon.

I would certainly go with documenting this feature as experimental which allow us to make API changes if needed. Do let me know where I could document this.

Oct 20 '18 10:10 shikharsg

Thanks for pushing forward on this. I'm out of radio contact for two weeks now, but I think this is moving in a good direction, please feel free to continue as you see fit. Will be great if we can try to keep code as simple and easy to understand as possible, with good separation of concerns and minimal code duplication.

On Sat, 20 Oct 2018, 11:04 shikharsg, [email protected] wrote:

After getting a few bugs in LRUChunkCache and seeing that it didn't have enough tests I figured that some of the tests from the base class StoreTests could be used for LRUChunkCache, but not all tests. So I separated the methods of StoreTests into two classes MutableMappingStoreTests and StoreTests, the latter inheriting the former. StoreTests, in addition to the methods of MutableMappingStoreTests , has test_hierarchy and the test_init* methods. Also LRUChunkCache inherits MutableMappingStoreTests. Doing this also let me easily fix a few bugs in LRUChunkCache. I hope this is not too big of a change and if you would like to shift any methods from StoreTests to MutableMappingStoreTests or vice versa, or even entirely revert this change do let me know.

I would like to work the the _chunk_getitem() simplification for a little more time, if that's okay. Will get back on this soon.

I would certainly go with documenting this feature as experimental which allow us to make API changes if needed. Do let me know where I could document this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/pull/306#issuecomment-431567083, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QhY-2OHvXYId5_gkGBqMj_DmToTaks5umvUegaJpZM4XXYK6 .

Oct 20 '18 10:10 alimanfoo

Maybe we should be looking to pull out common code from the two LRU classes into a common base class e.g. LRUMappingCache. Might also help with some refactoring of tests.

Oct 21 '18 14:10 alimanfoo

Apologies for the late reply. I have pulled out some common code from the two classes into LRUMappingCache. You can see in a PR here: https://github.com/shikharsg/zarr/pull/3/files. I haven't committed it to the PR branch yet, as I'm not sure if that's what you're asking for. So the LRUMappingCache will be like an Abstract Base Class which is not supposed to be used anywhere but as a super class? If so, is whatever I have done in the link above correct?

About refactoring of the tests (https://github.com/zarr-developers/zarr/pull/306/commits/6fac2adc4ab20289e84c254a3cb259a8717b4a14), so that some of the StoreTests can be used for LRUChunkCache, is that ok? Or should I revert that and implement it as a separate class?

Nov 10 '18 11:11 shikharsg

@alimanfoo, I have slightly simplified the _chunk_getitem function and excluded the optimization block from dealing with the chunk cache case.

Apart from refactoring of the store tests, is there anything left to do? Otherwise I think this would be good to go in?

Nov 27 '18 06:11 shikharsg

Hello @shikharsg! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2021-09-23 14:52:22 UTC

Feb 10 '19 12:02 pep8speaks

Hi all. Sorry for the massive delays on this.

I have refactored the LRUStoreCache and LRUChunkCache to pull out common code between them. I have also done the same for the tests of LRUStoreCache and LRUChunkCache.

@alimanfoo could you give a review to see if your concerns have been addressed?

Other thoughs @zarr-developers/core-devs ?

May 04 '19 10:05 shikharsg

It looks like the write_direct path in _chunk_getitem prevents caching in many cases. Maybe disable write_direct if both a cache and a compressor is attached.

Aug 22 '19 12:08 jobh

It looks like the write_direct path in _chunk_getitem prevents caching in many cases. Maybe disable write_direct if both a cache and a compressor is attached.

At the moment, this is the behavior yes. I guess it's not being picked up by tests, because there are no tests for this case. Will try adding some tests for this soon

Aug 31 '19 11:08 shikharsg

Am curious what the state of things are here. Is more work needed? Does it need a review? 🙂

Jan 06 '20 12:01 jakirkham

@jakirkham most of the code in this PR has been reviewed and approved by @alimanfoo.

There are some concerns about refactoring the LRUStoreCache and the LRUChunkCache code(and the related tests) to pull out the common logic from both. I have done that in the last few commits but this hasn't been reviewed yet. Would be great if this can be reviewed.

Jan 07 '20 12:01 shikharsg

Do we happen to know why the tests are failing? Or is that expected at this stage?

Sep 30 '20 20:09 jakirkham

Do we happen to know why the tests are failing? Or is that expected at this stage?

Sorry, that was just because of one failing doctest which I have now fixed.

Sep 30 '20 21:09 shikharsg

Hi @shikharsg, I tried my hand at merging origin/master into your branch. The diff looks sensible but the new code path at https://github.com/zarr-developers/zarr-python/pull/306/commits/8c8a69f002e20c6a0af35afaae83cb48e7286d53#diff-f640a0c570765203632323d71a9d1145ec6121c90db73f56e1271ab0efd5b8d8R1632 might need looking into. All the best, ~Josh

Feb 19 '21 08:02 joshmoore

Codecov Report

Merging #306 (06c899b) into main (cbe371f) will increase coverage by 0.00%. Report is 388 commits behind head on main. The diff coverage is 100.00%.

Additional details and impacted files

@@           Coverage Diff            @@
##             main     #306    +/-   ##
========================================
  Coverage   99.94%   99.94%            
========================================
  Files          31       31            
  Lines       10936    11135   +199     
========================================
+ Hits        10930    11129   +199     
  Misses          6        6

Files	Coverage Δ
zarr/__init__.py	`100.00% <ø> (ø)`
zarr/core.py	`100.00% <100.00%> (ø)`
zarr/creation.py	`100.00% <ø> (ø)`
zarr/hierarchy.py	`100.00% <100.00%> (ø)`
zarr/storage.py	`100.00% <100.00%> (ø)`
zarr/tests/test_core.py	`100.00% <100.00%> (ø)`
zarr/tests/test_storage.py	`100.00% <100.00%> (ø)`

Feb 20 '21 20:02 codecov[bot]

Thanks @joshmoore. Turned out be an easy fix actually.

Feb 20 '21 21:02 shikharsg

What's the next step here?

Mar 25 '21 14:03 joshmoore

In terms of functionality, it's all working. But there were some concerns regarding refactoring the code. So this PR needs a detailed review. Otherwise it's good to go in.

Apr 05 '21 16:04 shikharsg

Quick question: does this layer respect get/setitems for concurrent access with fsspec?

I ask because fsspec, of course, has its own idea of caching. That is file-oriented rather than chunk-wise, but since ReferenceFileSystem, chunks of other remote URLs can be regarded as standalone files too. It would be nice to have fsspec's caching layer also respect LRU (i.e., number of entries, or size of cache). That is orthogonal and shouldn't delay this PR any further.

Apr 05 '21 20:04 martindurant

Any chance this can be reviewed? After quite some profiling I realized that LRUStoreCache does not offer the performance benefits I expected, because of this decoding issue.

If I can do anything to move this PR forward, let me know.

Jun 28 '21 14:06 niowniow

I think the issue is mainly that the Array class is responsible for decompressing. Is there a reason the de/compressor is implemented this way? I haven't checked through the code in detail yet. Please bear with me if the following is nonsense:

My idea basically is to move all functionality of the compressor out of the Array class. As alternative one could implement a compressor that can be wrapped around a store, similar to how the LRUStoreCache is wrapped around it. then we could e.g. store =LRUStoreCache(CompressorWrapper(DirectoryStore))) and the LRUChunkCache wouldn't be necessary. Or is there another benefit from using the LRUChunkCache?

In the Array class we could remove __compressor and instead wrap the store with CompressorWrapper if there is no CompressorWrapper in the store chain. I guess the default would be to wrap the store unless the user explicitly disables it because they have wrapped it themselves. All parameters given to Array can just be propagated through to its CompressorWrapper instance

Edit: This would then probably require to distinguish chunk keys from other data.

Jul 05 '21 15:07 niowniow

I had another look at this PR. There is one corner-case not covered. If write_direct (here) is true, i.e. we request one whole chunk. It will not be cached. I don't know if that was intended.

Jul 16 '21 12:07 niowniow

Merged v2.9.3 into this branch.

Aug 27 '21 12:08 joshmoore

@ericpre: from https://github.com/hyperspy/hyperspy/pull/2798#issuecomment-916060062, do I understand correctly that you've tested this and it's working for you?

Sep 23 '21 14:09 joshmoore

@ericpre: from hyperspy/hyperspy#2798 (comment), do I understand correctly that you've tested this and it's working for you?

Yes, I have tested this branch and I can confirm the speed improvement. If you would like to have more details, I would need to check again but I would not have time to do before a week or so.

Sep 23 '21 16:09 ericpre

I would be very interested in this functionality. Anything I can do that would help finish this up?

Oct 20 '22 13:10 croth1

zarr-python zarr-python copied to clipboard

LRU cache for decoded chunks

Comment last updated at 2021-09-23 14:52:22 UTC

Codecov Report

zarr-python
zarr-python copied to clipboard