zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

feature(group): add group setitem api

Open jhamman opened this issue 1 year ago • 11 comments

I think https://github.com/zarr-developers/zarr-python/issues/251 was fixed in 2.x at some point but it this PR also makes it work with v3.

closes https://github.com/zarr-developers/zarr-python/issues/251 closes https://github.com/zarr-developers/zarr-python/issues/508

TODO:

  • [ ] Add unit tests and/or doctests in docstrings
  • [ ] Add docstrings and API docs for any new/modified user-facing classes and functions
  • [ ] New/modified features documented in docs/tutorial.rst
  • [ ] Changes documented in docs/release.rst
  • [ ] GitHub Actions have all passed
  • [ ] Test coverage is 100% (Codecov passes)

jhamman avatar Oct 17 '24 05:10 jhamman

Are you aiming to get this merged before 3.0 @jhamman? And if not, what is the recommended way to do this with the current beta? My naive attempt in napari/napari#7215 (here to translate:

grp[key] = arr

is:

grp.create_array(name=key, shape=arr.shape, dtype=arr.dtype, data=arr)

(which is a little clunky, maybe shape and dtype should be optional if data is provided?)

jni avatar Oct 17 '24 09:10 jni

putting aside the value of v2 compatibility, I'm not a fan of this API for a few reasons:

  • a zarr group is two mappings: its attributes and the members. Using __setitem__ for just one of those two mappings is arbitrary and unfair :judge:
  • using __setitem__ to create an array hides the full configuration options of array creation and can lead users to over-rely on zarr defaults
  • using __getitem__ to retrieve an array similarly offers no place to configure the runtime array-access properties like write_empty_chunks, caching, etc.

for the first point, we could scope all the hierarchy mutation stuff to a group.members attribute, e.g. group.members['array'] = array. but I would still want a way for users to configure the default array parameters, e.g. via our new config object.

grp.create_array(name=key, shape=arr.shape, dtype=arr.dtype, data=arr)

(which is a little clunky, maybe shape and dtype should be optional if data is provided?)

@jni I agree that this is clunky, and it's silly to expose data and dtype. I think it would be better if we separated "create a brand new array" from "create an array like this one" into two different methods.

d-v-b avatar Oct 17 '24 09:10 d-v-b

Regardless of Citizens United 😂, I don't think attributes have human rights @d-v-b, so I don't think "unfair" is really valid here. Which leaves "arbitrary". I think generally in hierarchical APIs like these it's very common for attributes to be under a .attrs, um, attribute? I think "commonly known API" is not arbitrary, but being a nice player in an ecosystem.

As to defaults, I am a big fan of them, let's have more of those. (See also #2083 😉)

Either way, I am looking for guidance as to what I should do in napari for now, and also, that guidance should probably go into #2102. 🙏

jni avatar Oct 17 '24 09:10 jni

Regardless of Citizens United 😂, I don't think attributes have human rights @d-v-b, so I don't think "unfair" is really valid here. Which leaves "arbitrary". I think generally in hierarchical APIs like these it's very common for attributes to be under a .attrs, um, attribute? I think "commonly known API" is not arbitrary, but being a nice player in an ecosystem.

I'm not aware of the hierarchical API ecosystem; I'm coming from a more basic "how do we model the data" perspective, and from that POV, arbitrary symmetry breaking is almost always a mistake. For example, if you had to make a JSON serializable model of a zarr group, you have three options. (Recall that a zarr group is a dict of attributes and a dict of arrays / groups)

  • option 1:

    group: {"array_1": {... model of array_1}, "group_1": {...model of group_1}, "attrs": {... the attributes}
    

    This sucks because we can have name collisions between arrays / groups and our attributes. Namely, an array or group named attrs will break this model.

  • option 2:

    group: {"attr_key_1": {... attrs under attribute key 1}, "members": {...the arrays and groups}}`
    

    This sucks for the same reason as above. We are mixing our attribute namespace with our group model namespace.

  • option 3:

    group: {"attrs": {... the attributes}, "members": {...the arrays and groups}}`
    

    This separates the attributes and members namespaces. This is the only one of the three that's a usable JSON data model.

We could of course ignore this in our Group API and pretend that a group can be modeled as dict[str, Array | Group], but that's a decision motivated by either laziness or deference to bad decisions made in the past. The right motivation is fairness :) which should lead us to explicitly separate the members and the attributes IF we want a dict-like data model for a group.

d-v-b avatar Oct 17 '24 10:10 d-v-b

that's a decision motivated by either laziness or deference to bad decisions made in the past.

ok, fair. ;)

jni avatar Oct 17 '24 11:10 jni

(but, sometimes, deference to the past can be important. I won't make that judgement call here. But hdf5, and (consequently?) NetCDF, both use the .attrs-style API, if I'm not mistaken. And zarr v2??)

jni avatar Oct 17 '24 11:10 jni

(but, sometimes, deference to the past can be important. I won't make that judgement call here. But hdf5, and (consequently?) NetCDF, both use the .attrs-style API, if I'm not mistaken. And zarr v2??)

and I would argue it's a mistake in all three cases. FWIW I didn't think about this much until I did pydantic-zarr, i.e. I had to embed the zarr data model into JSON, at which point I realized these problems with the Zarr v2 API (and the APIs that it was inspired from). since then I can't unsee it :see_no_evil:

d-v-b avatar Oct 17 '24 11:10 d-v-b

My opinion here is that we should include this in 3.0 but consider the following options:

  • rename AsyncGroup.setitem to something more in line with where we want the GroupAPI to go
  • on the sync Group
  • turn Group.arrays into a property
  • use the arrays.setter as a fast path
  • deprecate Group.__setitem__ in favor of Group.arrays['a'] = arr

@d-v-b - how would this API feel to you?

jhamman avatar Oct 17 '24 12:10 jhamman

My opinion here is that we should include this in 3.0 but consider the following options:

* rename `AsyncGroup.setitem` to something more in line with where we want the GroupAPI to go

* on the sync `Group`

* turn `Group.arrays` into a property

* use the `arrays.setter` as a fast path

* deprecate `Group.__setitem__` in favor of `Group.arrays['a'] = arr`

@d-v-b - how would this API feel to you?

This is definitely an improvement, and I don't think I'm opposed to it. I'm most excited about the deprecation part :)

Should we put default array codec settings + chunking in the config, and use that for invocations like Group.arrays['foo'] = arr? And similarly for __getitem__, although I don't think we have as much configuration surface are there. That way at least this stuff is configurable.

d-v-b avatar Oct 17 '24 12:10 d-v-b

I just ran across this blast from the past: https://github.com/zarr-developers/zarr-python/issues/508

jhamman avatar Oct 17 '24 23:10 jhamman

This is ready for a review. I've got taken the least invasive route. I think we may want to deprecate this but I see no harm in keeping it now. We can debate its removal after the 3.0 release.

jhamman avatar Oct 23 '24 22:10 jhamman

@d-v-b - I'm planning to merge this in its current state. If you would like to come behind with a set of deprecations and/or features that add a separate namespace for Group.arrays, go for it!

jhamman avatar Oct 24 '24 22:10 jhamman