zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Getitems: support `meta_array`

Open madsbk opened this issue 2 years ago • 10 comments

BaseStore now implements getitems(), which takes a meta_array argument. Besides simplifying the code, this enable stores to read directly to GPU memory: https://github.com/rapidsai/kvikio/pull/131

TODO:

  • [x] Add unit tests and/or doctests in docstrings
  • [x] Add docstrings and API docs for any new/modified user-facing classes and functions
  • [x] New/modified features documented in docs/tutorial.rst
  • [ ] Changes documented in docs/release.rst
  • [x] GitHub Actions have all passed
  • [x] Test coverage is 100% (Codecov passes)

madsbk avatar Sep 12 '22 12:09 madsbk

This pull request introduces 1 alert when merging 26b7d2d49f6daa073cc4dd59d7903f936cb72e41 into b677db3409d5d8f7b4a5f2e28df5d33e66c5098a - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

lgtm-com[bot] avatar Sep 12 '22 13:09 lgtm-com[bot]

Codecov Report

Merging #1131 (a1d3520) into main (4b0705c) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #1131   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           36        37    +1     
  Lines        14799     14855   +56     
=========================================
+ Hits         14799     14855   +56     
Impacted Files Coverage Δ
zarr/_storage/store.py 100.00% <100.00%> (ø)
zarr/context.py 100.00% <100.00%> (ø)
zarr/core.py 100.00% <100.00%> (ø)
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_storage.py 100.00% <100.00%> (ø)
zarr/tests/test_storage_v3.py 100.00% <100.00%> (ø)
zarr/tests/test_util.py 100.00% <100.00%> (ø)
zarr/tests/util.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

codecov[bot] avatar Sep 12 '22 13:09 codecov[bot]

Is this the last PR needed by kvik for full Zarr support ?

quasiben avatar Sep 21 '22 13:09 quasiben

Is this the last PR needed by kvik for full Zarr support ?

Yes with this PR and https://github.com/rapidsai/kvikio/pull/131, all building blocks should be in place to use Zarr backed by CuPy arrays!

madsbk avatar Sep 21 '22 16:09 madsbk

Is this the last PR needed by kvik for full Zarr support ?

Yes with this PR and https://github.com/rapidsai/kvikio/pull/131, all building blocks should be in place to use Zarr backed by CuPy arrays!

Just to clarify, this is a quite useful optimization (and much appreciated!). Though is it a requirement (in that an error was hit somewhere)? Is there a bit more context about how this came up (particularly if an issue was encountered)?

jakirkham avatar Sep 23 '22 18:09 jakirkham

Is this the last PR needed by kvik for full Zarr support ?

Yes with this PR and rapidsai/kvikio#131, all building blocks should be in place to use Zarr backed by CuPy arrays!

Just to clarify, this is a quite useful optimization (and much appreciated!). Though is it a requirement (in that an error was hit somewhere)? Is there a bit more context about how this came up (particularly if an issue was encountered)?

It is required in the sense that a backend doesn’t know the memory destination of a read without this PR. In KvikIO’s backend, we must rely on key names to infer if the data goes to host or device memory. E.g. the following code check for keys that should go to host memory always:

  if os.path.basename(fn) in [
      zarr.storage.array_meta_key,
      zarr.storage.group_meta_key,
      zarr.storage.attrs_key,
  ]:

This is a problem since a Store can define its own special keys with arbitrary names.

madsbk avatar Sep 26 '22 07:09 madsbk

Thanks for clarifying. Do we need setitems as well?

jakirkham avatar Sep 26 '22 08:09 jakirkham

Thanks for clarifying. Do we need setitems as well?

I don't think it is as important as the getitems since a backend can always probe the input data and determine if it is host or device memory. However, setitems would make it possible to implement concurrent write, which could benefit backends such as KvikIO significantly.

madsbk avatar Sep 26 '22 08:09 madsbk

@joshmoore do you have thoughts on this one? 🙂

jakirkham avatar Oct 06 '22 07:10 jakirkham

Based on @martindurant idea, I have replaced the meta_array argument with the more generic contexts argument.
When _chunk_getitems() calls getitems(), it now provides the meta_array (if non-default) to the store using the new contexts argument.

Thoughts?

madsbk avatar Oct 12 '22 09:10 madsbk

Where should we go from here, @joshmoore @jakirkham >

martindurant avatar Nov 01 '22 13:11 martindurant

Sorry for dragging my feet. Something about this is making me nervous, but I haven't been able to put my finger on it. Perhaps it's the lack of an interface.

joshmoore avatar Nov 03 '22 20:11 joshmoore

Agree, the lack of an interface/protocol is a concern. Since contexts doesn’t enforce anything, we are effectively saying that the client and store should negotiate their own protocol. This might become hard to maintain, if we do not clearly define a protocol for each context.

To mitigate this, I have added Context, which defines the context as a TypedDict. All components that want to use contexts to exchange information, must define said information in Context or in a sub-class. Furthermore, all items are optional thus components must implement a default implementation in the case an item isn’t found in the context.

class Context(TypedDict, total=False):
    """ A context for component specific information

    All keys are optional. Any component reading the context must provide
    a default implementation in the case a key cannot be found.

    Items
    -----
    meta_array : array-like, optional
        An array-like instance to use for determining the preferred output
        array type.
    """
    meta_array: NDArrayLike

Thoughts, do you think this will suffice?

madsbk avatar Nov 04 '22 09:11 madsbk

Suggested things I would add to context, suiting my purposes and perhaps explaining why I wanted this in the first place. All of these are already known before calling getitem(s) or decode chunk, they will never be None.

  • key string
  • compressor/filters list
  • whole final array shape and type
  • selected slice
  • slice within this chunk (this is actually calculated when processing the chunk, so I'm not sure it's immediately available without moving things around)

I think you can see how this information would allow for partial load from any storage layer that supports it, in combination with a compressor that supports it (uncompressed, blosc, zstd, maybe others); it also allows for a decoding step which is dependent on the position in the array (if it can be passed to codec.encode/decode too).

martindurant avatar Nov 06 '22 14:11 martindurant

Thanks for the list, @martindurant! Digging in a little bit to make sure I understand you:

  • key string: (obvious)
  • compressor/filters list: so this would allow different compressors or it would allow you to remove them from processing to get back more similar to what is actually stored?
  • whole final array shape and type: as in the static value for the current array? in which case we could likely do this for the user, no?
  • selected slice and slice within this chunk: These I'm less sure about. So when the implementation is calculating which key has been asked for, if it knows that less than a full chunk is needed, it passes that along?

joshmoore avatar Nov 10 '22 15:11 joshmoore

All of the information in context if supposed to be static, to allow decisions to be take in the load or codec. The meta-array is the only one so far that the user can have a say in (but it is fixed throughout the call to zarr).

compressor/filters list: so this would allow different compressors or it would allow you to remove them from processing to get back more similar to what is actually stored?

No, this is just the list from the array metadata, read-only. It allows the equivalent of all the if compressor == "blosc" stuff currently peppered around the code to allow block-partial reading.

selected slice and slice within this chunk: These I'm less sure about. So when the implementation is calculating which key has been asked for, if it knows that less than a full chunk is needed, it passes that along?

Yes, in zarr.Array._chunk_getitems we have the chunk_select (which part of the logical chunk we need) and out_select (where does this go in the final array), but they are not passed to the loader or codecs as things stand. See the gnarly code around blosc artials int he same function.

martindurant avatar Nov 10 '22 16:11 martindurant

Share Josh's concern about a protocol.

If we go with the context route, think Mads' proposal might be a reasonable solution (need to think more about this). Though as there will inevitably changes there, would suggest we add a version key-value pair as well. This way we can confirm what is supported.

jakirkham avatar Nov 10 '22 18:11 jakirkham

It would be good if we could get some progress here. Do we want:

  • a Context argument as defined in https://github.com/zarr-developers/zarr-python/pull/1131#issuecomment-1303174124 or similar.
  • a meta_array argument that corresponds to the meta_array argument used throughout the code.
  • something else?

madsbk avatar Dec 07 '22 13:12 madsbk

Obviously I have my opinion, so I don't get to vote, but perhaps @joshmoore @jakirkham , can you adjudicate the way forward here?

martindurant avatar Dec 08 '22 15:12 martindurant

Chatted with @jakirkham, @jbms, and @MSanKeys963 about the context here this evening. We were unsure of the (multiple? overlapping?) use cases. Would it be possible to get some details on them here or discuss at the next community meeting?

  • What features are specific to the GPU use case?
  • What is fsspec specific?
  • Sharding specific?
  • Does it need to touch zarr core?
  • Do the codecs need to be "aware" and if so, why?
  • Would/could @jstriebel's transformers in #1096 provide a way to "capture" the multiple gets and serialize them into a single get?
  • If so, does the transformer extension need any additional metadata or features?

Basically, who needs to know what where when and why? 🙂

joshmoore avatar Jan 11 '23 20:01 joshmoore

I will try...

What features are specific to the GPU use case?

The array_meta itself: what kind of instance will the incoming data be put into. This was originally a standalone kwargs, but could be in the context.

What is fsspec specific?

As far as existing FSStore workflows are concerned, nothing. However, passing a context would allow codec-level and storage-level per-chunk operations, something kerchunk is interested in, since it often is making a logical dataset out of inputs which do not have homogenous nature.

Sharding specific?

Passing the part of a desired chunk needed to the loader means that only required shards need be loaded. This would isolate the logic much more cleanly in the IO machinery than having the specific shard- and blosc blocks of code in the higher levels as we have now.

Does it need to touch zarr core?

It is needed in the implementation. It is not a spec issue, the contract between zarr and the storage classes (and indeed codecs) is I think only described in code.

Do the codecs need to be "aware" and if so, why?

In order for the codecs to be able to make use of the context (e.g., position-dependent parameters), it need to be passed to them. I imagine this as an optional kwarg. All codecs should ideally have this in their encode/decode signatures, but currently existing ones would just ignore it. We can use inspect to determine if the kwargs can be passed for compatibility.

Would/could @jstriebel's transformers in https://github.com/zarr-developers/zarr-python/pull/1096 provide a way to "capture" the multiple gets and serialize them into a single get? If so, does the transformer extension need any additional metadata or features?

I don't know. In some ways, the context is a simpler way to do partial get/sets, but nothing here approaches a general transform. Zarr being the flexible and simple thing it is, it would be possible to achieve some things I am after (chunk-specific ops, irregular chunks, shards) in the storage layer alone and not touch the spec/impl for those. The exception is partial chunks, since the codec and storage has no way of knowing when a key is requested, whether all of that key is needed or not.

martindurant avatar Jan 12 '23 20:01 martindurant

We need to get this moving. If the context approach is too generic/complex, I think we should go with the original approach and add the meta_array argument, which is used throughout the code already.

I am working on a new cuNumeric backend, which also requires this functionality.

madsbk avatar Mar 10 '23 14:03 madsbk

Resolved conflict.
I suggest we go with this approach for now and then unify get_partial_values() and getitems() later.

madsbk avatar Mar 13 '23 16:03 madsbk

@joshmoore & @jakirkham - any objections to merging this as-is and deferring the broader discussions for future work?

rabernat avatar Mar 13 '23 19:03 rabernat

Thanks for the ping, @rabernat. I have to say I'm liking the typed Context better. I will read through again ASAP.

joshmoore avatar Mar 15 '23 09:03 joshmoore

Notice the link to FITS reading: another example of per-chunk parameters that could be fixed by passing around a context, although it's not the only way to do it. I'd say that there are a number of efforts (some more and some less well developed) waiting to see what zarr will do.

martindurant avatar Mar 24 '23 17:03 martindurant

@joshmoore, how can I help progress here? Maybe an online meeting?

madsbk avatar Mar 30 '23 12:03 madsbk

Sorry, @madsbk. My last comment was shortly before leaving on several weeks of travel and I lost of track of this. It wasn’t my intention to hold you up, and big :+1: for the ping. A few thoughts (from smallest to biggest) though I get the feeling that you are being ping-ponged back and forth between design issues:

  • If we're less than confident on this API, perhaps we should make this a keyword-only argument. That way the second position could potentially be used for something else and deprecation/renaming would be more straight-forward.
  • Unlike some of the context that was going to get passed, I have the feeling that the meta_array is more likely to be used for all chunks equally. Does that sound right? If so, I'm a bit hesitant to force someone to create dictionaries of N objects all with the same meta_array instance wrapped in new Context object. As I started to think about "wildcard" keys in the context argument or an additional default_context argument, I realized that perhaps we've not gotten this abstract right yet.
  • Finally, I've still not got a clear picture of what other fields on the Context object are going to help with the likes of partial read. Obviously, that's less to do your PR but just to callback to the original discussions about whether or not we knew what the API should look like.

joshmoore avatar Apr 04 '23 16:04 joshmoore

If we're less than confident on this API, perhaps we should make this a keyword-only argument. That way the second position could potentially be used for something else and deprecation/renaming would be more straight-forward.

Agree, fixed.

Unlike some of the context that was going to get passed, I have the feeling that the meta_array is more likely to be used for all chunks equally. Does that sound right? If so, I'm a bit hesitant to force someone to create dictionaries of N objects all with the same meta_array instance wrapped in new Context object. As I started to think about "wildcard" keys in the context argument or an additional default_context argument, I realized that perhaps we've not gotten this abstract right yet.

I have added ConstantMap, which is a read-only map that maps all keys to the same constant value. This fixes the issue of having to create a Context for each key without complicating the semantic of getitems().

Finally, I've still not got a clear picture of what other fields on the Context object are going to help with the likes of partial read. Obviously, that's less to do your PR but just to callback to the original discussions about whether or not we knew what the API should look like.

I don't think any of us have a clear answer to this. Therefore, I suggest that we merge this PR and let people explore its possibilities. As long as Context is optional and we force Stores to implement a fallback if it doesn't exist, I think we are fine.

madsbk avatar Apr 11 '23 11:04 madsbk

Thanks all!

madsbk avatar Apr 14 '23 07:04 madsbk