zarr-python
zarr-python copied to clipboard
Getitems: support `meta_array`
BaseStore
now implements getitems()
, which takes a meta_array
argument. Besides simplifying the code, this enable stores to read directly to GPU memory: https://github.com/rapidsai/kvikio/pull/131
TODO:
- [x] Add unit tests and/or doctests in docstrings
- [x] Add docstrings and API docs for any new/modified user-facing classes and functions
- [x] New/modified features documented in docs/tutorial.rst
- [ ] Changes documented in docs/release.rst
- [x] GitHub Actions have all passed
- [x] Test coverage is 100% (Codecov passes)
This pull request introduces 1 alert when merging 26b7d2d49f6daa073cc4dd59d7903f936cb72e41 into b677db3409d5d8f7b4a5f2e28df5d33e66c5098a - view on LGTM.com
new alerts:
- 1 for Signature mismatch in overriding method
Codecov Report
Merging #1131 (a1d3520) into main (4b0705c) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## main #1131 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 36 37 +1
Lines 14799 14855 +56
=========================================
+ Hits 14799 14855 +56
Impacted Files | Coverage Δ | |
---|---|---|
zarr/_storage/store.py | 100.00% <100.00%> (ø) |
|
zarr/context.py | 100.00% <100.00%> (ø) |
|
zarr/core.py | 100.00% <100.00%> (ø) |
|
zarr/storage.py | 100.00% <100.00%> (ø) |
|
zarr/tests/test_storage.py | 100.00% <100.00%> (ø) |
|
zarr/tests/test_storage_v3.py | 100.00% <100.00%> (ø) |
|
zarr/tests/test_util.py | 100.00% <100.00%> (ø) |
|
zarr/tests/util.py | 100.00% <100.00%> (ø) |
|
zarr/util.py | 100.00% <100.00%> (ø) |
Is this the last PR needed by kvik for full Zarr support ?
Is this the last PR needed by kvik for full Zarr support ?
Yes with this PR and https://github.com/rapidsai/kvikio/pull/131, all building blocks should be in place to use Zarr backed by CuPy arrays!
Is this the last PR needed by kvik for full Zarr support ?
Yes with this PR and https://github.com/rapidsai/kvikio/pull/131, all building blocks should be in place to use Zarr backed by CuPy arrays!
Just to clarify, this is a quite useful optimization (and much appreciated!). Though is it a requirement (in that an error was hit somewhere)? Is there a bit more context about how this came up (particularly if an issue was encountered)?
Is this the last PR needed by kvik for full Zarr support ?
Yes with this PR and rapidsai/kvikio#131, all building blocks should be in place to use Zarr backed by CuPy arrays!
Just to clarify, this is a quite useful optimization (and much appreciated!). Though is it a requirement (in that an error was hit somewhere)? Is there a bit more context about how this came up (particularly if an issue was encountered)?
It is required in the sense that a backend doesn’t know the memory destination of a read without this PR. In KvikIO’s backend, we must rely on key names to infer if the data goes to host or device memory. E.g. the following code check for keys that should go to host memory always:
if os.path.basename(fn) in [
zarr.storage.array_meta_key,
zarr.storage.group_meta_key,
zarr.storage.attrs_key,
]:
This is a problem since a Store
can define its own special keys with arbitrary names.
Thanks for clarifying. Do we need setitems
as well?
Thanks for clarifying. Do we need
setitems
as well?
I don't think it is as important as the getitems
since a backend can always probe the input data and determine if it is host or device memory.
However, setitems
would make it possible to implement concurrent write, which could benefit backends such as KvikIO significantly.
@joshmoore do you have thoughts on this one? 🙂
Based on @martindurant idea, I have replaced the meta_array
argument with the more generic contexts
argument.
When _chunk_getitems()
calls getitems()
, it now provides the meta_array
(if non-default) to the store using the new contexts
argument.
Thoughts?
Where should we go from here, @joshmoore @jakirkham >
Sorry for dragging my feet. Something about this is making me nervous, but I haven't been able to put my finger on it. Perhaps it's the lack of an interface.
Agree, the lack of an interface/protocol is a concern. Since contexts
doesn’t enforce anything, we are effectively saying that the client and store should negotiate their own protocol. This might become hard to maintain, if we do not clearly define a protocol for each context.
To mitigate this, I have added Context
, which defines the context as a TypedDict
. All components that want to use contexts
to exchange information, must define said information in Context
or in a sub-class. Furthermore, all items are optional thus components must implement a default implementation in the case an item isn’t found in the context.
class Context(TypedDict, total=False):
""" A context for component specific information
All keys are optional. Any component reading the context must provide
a default implementation in the case a key cannot be found.
Items
-----
meta_array : array-like, optional
An array-like instance to use for determining the preferred output
array type.
"""
meta_array: NDArrayLike
Thoughts, do you think this will suffice?
Suggested things I would add to context, suiting my purposes and perhaps explaining why I wanted this in the first place. All of these are already known before calling getitem(s) or decode chunk, they will never be None.
- key string
- compressor/filters list
- whole final array shape and type
- selected slice
- slice within this chunk (this is actually calculated when processing the chunk, so I'm not sure it's immediately available without moving things around)
I think you can see how this information would allow for partial load from any storage layer that supports it, in combination with a compressor that supports it (uncompressed, blosc, zstd, maybe others); it also allows for a decoding step which is dependent on the position in the array (if it can be passed to codec.encode/decode too).
Thanks for the list, @martindurant! Digging in a little bit to make sure I understand you:
- key string: (obvious)
- compressor/filters list: so this would allow different compressors or it would allow you to remove them from processing to get back more similar to what is actually stored?
- whole final array shape and type: as in the static value for the current array? in which case we could likely do this for the user, no?
- selected slice and slice within this chunk: These I'm less sure about. So when the implementation is calculating which key has been asked for, if it knows that less than a full chunk is needed, it passes that along?
All of the information in context if supposed to be static, to allow decisions to be take in the load or codec. The meta-array is the only one so far that the user can have a say in (but it is fixed throughout the call to zarr).
compressor/filters list: so this would allow different compressors or it would allow you to remove them from processing to get back more similar to what is actually stored?
No, this is just the list from the array metadata, read-only. It allows the equivalent of all the if compressor == "blosc"
stuff currently peppered around the code to allow block-partial reading.
selected slice and slice within this chunk: These I'm less sure about. So when the implementation is calculating which key has been asked for, if it knows that less than a full chunk is needed, it passes that along?
Yes, in zarr.Array._chunk_getitems
we have the chunk_select (which part of the logical chunk we need) and out_select (where does this go in the final array), but they are not passed to the loader or codecs as things stand. See the gnarly code around blosc artials int he same function.
Share Josh's concern about a protocol.
If we go with the context route, think Mads' proposal might be a reasonable solution (need to think more about this). Though as there will inevitably changes there, would suggest we add a version key-value pair as well. This way we can confirm what is supported.
It would be good if we could get some progress here. Do we want:
- a
Context
argument as defined in https://github.com/zarr-developers/zarr-python/pull/1131#issuecomment-1303174124 or similar. - a
meta_array
argument that corresponds to themeta_array
argument used throughout the code. - something else?
Obviously I have my opinion, so I don't get to vote, but perhaps @joshmoore @jakirkham , can you adjudicate the way forward here?
Chatted with @jakirkham, @jbms, and @MSanKeys963 about the context here this evening. We were unsure of the (multiple? overlapping?) use cases. Would it be possible to get some details on them here or discuss at the next community meeting?
- What features are specific to the GPU use case?
- What is fsspec specific?
- Sharding specific?
- Does it need to touch zarr core?
- Do the codecs need to be "aware" and if so, why?
- Would/could @jstriebel's transformers in #1096 provide a way to "capture" the multiple gets and serialize them into a single get?
- If so, does the transformer extension need any additional metadata or features?
Basically, who needs to know what where when and why? 🙂
I will try...
What features are specific to the GPU use case?
The array_meta itself: what kind of instance will the incoming data be put into. This was originally a standalone kwargs, but could be in the context.
What is fsspec specific?
As far as existing FSStore workflows are concerned, nothing. However, passing a context would allow codec-level and storage-level per-chunk operations, something kerchunk is interested in, since it often is making a logical dataset out of inputs which do not have homogenous nature.
Sharding specific?
Passing the part of a desired chunk needed to the loader means that only required shards need be loaded. This would isolate the logic much more cleanly in the IO machinery than having the specific shard- and blosc blocks of code in the higher levels as we have now.
Does it need to touch zarr core?
It is needed in the implementation. It is not a spec issue, the contract between zarr and the storage classes (and indeed codecs) is I think only described in code.
Do the codecs need to be "aware" and if so, why?
In order for the codecs to be able to make use of the context (e.g., position-dependent parameters), it need to be passed to them. I imagine this as an optional kwarg. All codecs should ideally have this in their encode/decode signatures, but currently existing ones would just ignore it. We can use inspect to determine if the kwargs can be passed for compatibility.
Would/could @jstriebel's transformers in https://github.com/zarr-developers/zarr-python/pull/1096 provide a way to "capture" the multiple gets and serialize them into a single get? If so, does the transformer extension need any additional metadata or features?
I don't know. In some ways, the context is a simpler way to do partial get/sets, but nothing here approaches a general transform. Zarr being the flexible and simple thing it is, it would be possible to achieve some things I am after (chunk-specific ops, irregular chunks, shards) in the storage layer alone and not touch the spec/impl for those. The exception is partial chunks, since the codec and storage has no way of knowing when a key is requested, whether all of that key is needed or not.
We need to get this moving. If the context approach is too generic/complex, I think we should go with the original approach and add the meta_array
argument, which is used throughout the code already.
I am working on a new cuNumeric backend, which also requires this functionality.
Resolved conflict.
I suggest we go with this approach for now and then unify get_partial_values()
and getitems()
later.
@joshmoore & @jakirkham - any objections to merging this as-is and deferring the broader discussions for future work?
Thanks for the ping, @rabernat. I have to say I'm liking the typed Context
better. I will read through again ASAP.
Notice the link to FITS reading: another example of per-chunk parameters that could be fixed by passing around a context, although it's not the only way to do it. I'd say that there are a number of efforts (some more and some less well developed) waiting to see what zarr will do.
@joshmoore, how can I help progress here? Maybe an online meeting?
Sorry, @madsbk. My last comment was shortly before leaving on several weeks of travel and I lost of track of this. It wasn’t my intention to hold you up, and big :+1: for the ping. A few thoughts (from smallest to biggest) though I get the feeling that you are being ping-ponged back and forth between design issues:
- If we're less than confident on this API, perhaps we should make this a keyword-only argument. That way the second position could potentially be used for something else and deprecation/renaming would be more straight-forward.
- Unlike some of the context that was going to get passed, I have the feeling that the
meta_array
is more likely to be used for all chunks equally. Does that sound right? If so, I'm a bit hesitant to force someone to create dictionaries of N objects all with the samemeta_array
instance wrapped in newContext
object. As I started to think about "wildcard" keys in thecontext
argument or an additionaldefault_context
argument, I realized that perhaps we've not gotten this abstract right yet. - Finally, I've still not got a clear picture of what other fields on the
Context
object are going to help with the likes of partial read. Obviously, that's less to do your PR but just to callback to the original discussions about whether or not we knew what the API should look like.
If we're less than confident on this API, perhaps we should make this a keyword-only argument. That way the second position could potentially be used for something else and deprecation/renaming would be more straight-forward.
Agree, fixed.
Unlike some of the context that was going to get passed, I have the feeling that the meta_array is more likely to be used for all chunks equally. Does that sound right? If so, I'm a bit hesitant to force someone to create dictionaries of N objects all with the same meta_array instance wrapped in new Context object. As I started to think about "wildcard" keys in the context argument or an additional default_context argument, I realized that perhaps we've not gotten this abstract right yet.
I have added ConstantMap
, which is a read-only map that maps all keys to the same constant value. This fixes the issue of having to create a Context
for each key without complicating the semantic of getitems()
.
Finally, I've still not got a clear picture of what other fields on the Context object are going to help with the likes of partial read. Obviously, that's less to do your PR but just to callback to the original discussions about whether or not we knew what the API should look like.
I don't think any of us have a clear answer to this. Therefore, I suggest that we merge this PR and let people explore its possibilities. As long as Context
is optional and we force Stores
to implement a fallback if it doesn't exist, I think we are fine.
Thanks all!