feature(group): add group setitem api
I think https://github.com/zarr-developers/zarr-python/issues/251 was fixed in 2.x at some point but it this PR also makes it work with v3.
closes https://github.com/zarr-developers/zarr-python/issues/251 closes https://github.com/zarr-developers/zarr-python/issues/508
TODO:
- [ ] Add unit tests and/or doctests in docstrings
- [ ] Add docstrings and API docs for any new/modified user-facing classes and functions
- [ ] New/modified features documented in docs/tutorial.rst
- [ ] Changes documented in docs/release.rst
- [ ] GitHub Actions have all passed
- [ ] Test coverage is 100% (Codecov passes)
Are you aiming to get this merged before 3.0 @jhamman? And if not, what is the recommended way to do this with the current beta? My naive attempt in napari/napari#7215 (here to translate:
grp[key] = arr
is:
grp.create_array(name=key, shape=arr.shape, dtype=arr.dtype, data=arr)
(which is a little clunky, maybe shape and dtype should be optional if data is provided?)
putting aside the value of v2 compatibility, I'm not a fan of this API for a few reasons:
- a zarr group is two mappings: its attributes and the members. Using
__setitem__for just one of those two mappings is arbitrary and unfair :judge: - using
__setitem__to create an array hides the full configuration options of array creation and can lead users to over-rely on zarr defaults - using
__getitem__to retrieve an array similarly offers no place to configure the runtime array-access properties likewrite_empty_chunks, caching, etc.
for the first point, we could scope all the hierarchy mutation stuff to a group.members attribute, e.g. group.members['array'] = array. but I would still want a way for users to configure the default array parameters, e.g. via our new config object.
grp.create_array(name=key, shape=arr.shape, dtype=arr.dtype, data=arr)(which is a little clunky, maybe shape and dtype should be optional if data is provided?)
@jni I agree that this is clunky, and it's silly to expose data and dtype. I think it would be better if we separated "create a brand new array" from "create an array like this one" into two different methods.
Regardless of Citizens United 😂, I don't think attributes have human rights @d-v-b, so I don't think "unfair" is really valid here. Which leaves "arbitrary". I think generally in hierarchical APIs like these it's very common for attributes to be under a .attrs, um, attribute? I think "commonly known API" is not arbitrary, but being a nice player in an ecosystem.
As to defaults, I am a big fan of them, let's have more of those. (See also #2083 😉)
Either way, I am looking for guidance as to what I should do in napari for now, and also, that guidance should probably go into #2102. 🙏
Regardless of Citizens United 😂, I don't think attributes have human rights @d-v-b, so I don't think "unfair" is really valid here. Which leaves "arbitrary". I think generally in hierarchical APIs like these it's very common for attributes to be under a .attrs, um, attribute? I think "commonly known API" is not arbitrary, but being a nice player in an ecosystem.
I'm not aware of the hierarchical API ecosystem; I'm coming from a more basic "how do we model the data" perspective, and from that POV, arbitrary symmetry breaking is almost always a mistake. For example, if you had to make a JSON serializable model of a zarr group, you have three options. (Recall that a zarr group is a dict of attributes and a dict of arrays / groups)
-
option 1:
group: {"array_1": {... model of array_1}, "group_1": {...model of group_1}, "attrs": {... the attributes}This sucks because we can have name collisions between arrays / groups and our attributes. Namely, an array or group named
attrswill break this model. -
option 2:
group: {"attr_key_1": {... attrs under attribute key 1}, "members": {...the arrays and groups}}`This sucks for the same reason as above. We are mixing our attribute namespace with our group model namespace.
-
option 3:
group: {"attrs": {... the attributes}, "members": {...the arrays and groups}}`This separates the attributes and members namespaces. This is the only one of the three that's a usable JSON data model.
We could of course ignore this in our Group API and pretend that a group can be modeled as dict[str, Array | Group], but that's a decision motivated by either laziness or deference to bad decisions made in the past. The right motivation is fairness :) which should lead us to explicitly separate the members and the attributes IF we want a dict-like data model for a group.
that's a decision motivated by either laziness or deference to bad decisions made in the past.
ok, fair. ;)
(but, sometimes, deference to the past can be important. I won't make that judgement call here. But hdf5, and (consequently?) NetCDF, both use the .attrs-style API, if I'm not mistaken. And zarr v2??)
(but, sometimes, deference to the past can be important. I won't make that judgement call here. But hdf5, and (consequently?) NetCDF, both use the .attrs-style API, if I'm not mistaken. And zarr v2??)
and I would argue it's a mistake in all three cases. FWIW I didn't think about this much until I did pydantic-zarr, i.e. I had to embed the zarr data model into JSON, at which point I realized these problems with the Zarr v2 API (and the APIs that it was inspired from). since then I can't unsee it :see_no_evil:
My opinion here is that we should include this in 3.0 but consider the following options:
- rename
AsyncGroup.setitemto something more in line with where we want the GroupAPI to go - on the sync
Group - turn
Group.arraysinto a property - use the
arrays.setteras a fast path - deprecate
Group.__setitem__in favor ofGroup.arrays['a'] = arr
@d-v-b - how would this API feel to you?
My opinion here is that we should include this in 3.0 but consider the following options:
* rename `AsyncGroup.setitem` to something more in line with where we want the GroupAPI to go * on the sync `Group` * turn `Group.arrays` into a property * use the `arrays.setter` as a fast path * deprecate `Group.__setitem__` in favor of `Group.arrays['a'] = arr`@d-v-b - how would this API feel to you?
This is definitely an improvement, and I don't think I'm opposed to it. I'm most excited about the deprecation part :)
Should we put default array codec settings + chunking in the config, and use that for invocations like Group.arrays['foo'] = arr? And similarly for __getitem__, although I don't think we have as much configuration surface are there. That way at least this stuff is configurable.
I just ran across this blast from the past: https://github.com/zarr-developers/zarr-python/issues/508
This is ready for a review. I've got taken the least invasive route. I think we may want to deprecate this but I see no harm in keeping it now. We can debate its removal after the 3.0 release.
@d-v-b - I'm planning to merge this in its current state. If you would like to come behind with a set of deprecations and/or features that add a separate namespace for Group.arrays, go for it!