zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Make `create_array` signatures consistent

Open d-v-b opened this issue 10 months ago • 24 comments

This PR ensures that the various invocations of create_array are consistent. For reference, we have 4 ways to call create_array:

  • zarr.core.array.create_array (the actual function that does stuff)
  • zarr.api.synchronous.create_array (synchronous wrapper around the async function)
  • zarr.core.group.AsyncGroup.create_array (method on AsyncGroup class that invokes create_array)
  • zarr.core.group.Group.create_array (synchronous wrapper around the AsyncGroup method)

All of these functions should have consistent signatures, but in main they don't, and a big part of this is missing tests. So this PR adds some tests that check that certain pairs of functions have identical parameters (we can't force the return types to match, because the async functions will return coroutines). To make the tests pass, this PR also ensures that all of the invocations of create_array have parameters that are consistent.

I say "consistent" because, when invoking Group.create_array, that method does not take a store argument, because we have one already from the group instance. Also, Group.create_array was recently given an extra keyword argument (compressor) that we need to deprecate. So the test that compares zarr.core.array.create_array with zarr.core.group.AsyncGroup.create_array only checks that all of create_array parameters are present in the group method.

closes #2810

d-v-b avatar Feb 12 '25 16:02 d-v-b

worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more defaulty.

d-v-b avatar Feb 12 '25 16:02 d-v-b

worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more default.

fill_value = None seems a more sensible default to me, with some docs somewhere explaining how this maps to a default fill value for each data type. 0 needs conversion for any non-integer data type, and this is exactly what caused https://github.com/zarr-developers/zarr-python/issues/2792#issuecomment-2644362122 in zarr-python 2.

LDeakin avatar Feb 13 '25 21:02 LDeakin

worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more default.

fill_value = None seems a more sensible default to me, with some docs somewhere explaining how this maps to a default fill value for each data type. 0 needs conversion for any non-integer data type, and this is exactly what caused #2792 (comment) in zarr-python 2.

That works for me. In main I think we have a mix of 0 and None floating around, I think I chose 0 for this PR because it seemed like the majority, but I'm happy to switch to None here

d-v-b avatar Feb 13 '25 21:02 d-v-b

with https://github.com/zarr-developers/zarr-python/pull/2819/commits/642272dcaf0ae313034046a494225ae4a87d6d31 the default fill value is None for the array creation funcs.

d-v-b avatar Feb 27 '25 10:02 d-v-b

Thanks for the changelog update, looks great to me. I left one more suggestion - I did a bit of testing and this doesn't seem to change the fill value that's set on the arrays (which is good, but I think worth reassuring users)

how did you check this? because for zarr v2 data, it should make a difference -- None is a valid v2 fill value, and it's not the same as 0

d-v-b avatar Feb 28 '25 09:02 d-v-b

I didn't check for v2 🙈 . If it does make a difference, that should be made very clear in the changelog (and we should think about putting this PR in a non-bugfix release since it's a breaking change?)

dstansby avatar Feb 28 '25 09:02 dstansby

how did you check this? because for zarr v2 data, it should make a difference -- None is a valid v2 fill value, and it's not the same as 0

I didn't check for v2 🙈 . If it does make a difference, that should be made very clear in the changelog (and we should think about putting this PR in a non-bugfix release since it's a breaking change?)

I'd hope that since the below PRs, the metadata of the Zarr V2 data with fill_values=None will have a null fill value but actual stored fill values will match the V3 default rather than being completely undefined.

  • https://github.com/zarr-developers/zarr-python/pull/2274
  • https://github.com/zarr-developers/zarr-python/pull/2799

LDeakin avatar Feb 28 '25 09:02 LDeakin

@LDeakin that was my impression as well. I think for v2 data the contents of the array metadata will be sensitive to the None / 0 distinction, but at runtime the actual array data should be the same in zarr-python. but this might vary in other implementations.

d-v-b avatar Feb 28 '25 09:02 d-v-b

I think for v2 data the contents of the array metadata will be sensitive to the None / 0 distinction, but at runtime the actual array data should be the same in zarr-python. but this might vary in other implementations

Indeed! To be honest, I think null fill values in Zarr V2 were a bit of a misstep anyway, mainly because undefined values in partially written chunks cannot be distinguished from real data. This could be a good opportunity for zarr-python to stop writing null fill values altogether in Zarr V2 metadata and just write the default fill value. What do you think?

LDeakin avatar Feb 28 '25 09:02 LDeakin

The nullable fill value is also clunky because zarr v2 supports the python Object dtype, and None is a valid instance of the Object dtype, which makes it impossible to distinguish "no fill value" from "the fill value is the literal None" for arrays with that dtype (or any other dtype that admits None as a value).

We would have to get a measure of the impact on users, but if it doesn't inconvenience a lot of people then I would definitely support dropping the creation of arrays with a null fill value, at least for numeric types

d-v-b avatar Feb 28 '25 09:02 d-v-b

that works for me! would you like to open an issue about the versioning policy, or should I?

d-v-b avatar Mar 04 '25 20:03 d-v-b

I can 👍

dstansby avatar Mar 04 '25 20:03 dstansby

it turns out that the default fill value of 0 is problematic for #2874, so I'd like to elevate the priority of this PR. Given our new versioning policy, do you think this is good to go after I sort out the merge conflicts @dstansby?

d-v-b avatar May 22 '25 08:05 d-v-b

Is changing the default value of the fill_value argument a breaking change? Either way that should be made clearer in the changelog entry - and if it is breaking, we need a release plan for 3.1 before merging this, so we don't accidentally end up in a place where stuff targeted for 3.1 ends up blocking bug fix releases for a long time while all the work for 3.1 is finished.

dstansby avatar May 22 '25 09:05 dstansby

ultimately I think whether these changes are breaking depends on whether you believe it's a bug that different definitions of the same function have different default values.

d-v-b avatar May 22 '25 09:05 d-v-b

Does changing the values affect any behaviour, or any metadata that gets written to zarr arrays or groups? If not, then I'd say this isn't breaking. If it does, then I'd say it is breaking.

dstansby avatar May 22 '25 09:05 dstansby

I think there are two questions:

  • Is the current situation (the same function with different default values) bugged? I believe the answer to this is "yes".
  • Is it possible to fix that bug without changing behavior of zarr-python? I believe the answer to this is "no".

Because this library produces data, we will inevitably have bugs that affect how data is produced. Therefore, fixing those bugs will change how data is produced. If the bug fix takes us from "writes invalid data" to "writes valid data", that's a clear win. But if the bug fix takes us from "writes valid data" to "writes slightly different valid data", then things are less clear cut. I think we are in this second scenario.

For simplicity lets say that these changes are breaking, and thus a good fit for a 3.1 release. it might be time to think about a 3.1 branch!

d-v-b avatar May 22 '25 10:05 d-v-b

I think my confusion/questions are stemming from the short explanation of what's changed in the changelog entry. I think the default argument would benefit from being being added to it's own breaking or bugfix entry (whichever is more appropriate), and expanding to explain how this changes what metadata is written to files, and because it's a different fill value, how that changes the data values returned from emtpy portions of an array with fill_value=None vs. fill_value=0.

dstansby avatar May 23 '25 08:05 dstansby

how that changes the data values returned from emtpy portions of an array with fill_value=None vs. fill_value=0.

there is no change here, zarr-python uses an array of zeros (or the zeros-equivalent, per data type) in either case.

d-v-b avatar May 23 '25 09:05 d-v-b

I still need to wrap my head around the fill value changes a bit - in the meantime, perhaps adding the new arguments could be done in a separate PR, since that's definitely backwards compatible and probably something we could put in a 3.0.x release?

dstansby avatar May 27 '25 11:05 dstansby

i would rather not split this PR into pieces.

d-v-b avatar May 27 '25 12:05 d-v-b

Ah, because that would ruin the automatic testing you added - gotcha. Will try and give this a proper re-review soon.

dstansby avatar May 28 '25 09:05 dstansby

Pinging @zarr-developers/python-core-devs - would be good to get a second review on this

dstansby avatar Jun 18 '25 08:06 dstansby

the test failure here is windows-specific and fixed in https://github.com/zarr-developers/zarr-python/pull/3151

d-v-b avatar Jun 19 '25 09:06 d-v-b

these function signature tests are gold. they caught a bunch of recently-added inconsistencies

d-v-b avatar Jul 09 '25 15:07 d-v-b

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 94.72%. Comparing base (378d5af) to head (48c8e3b). :warning: Report is 123 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2819      +/-   ##
==========================================
- Coverage   94.73%   94.72%   -0.02%     
==========================================
  Files          78       78              
  Lines        8646     8646              
==========================================
- Hits         8191     8190       -1     
- Misses        455      456       +1     
Files with missing lines Coverage Δ
src/zarr/api/asynchronous.py 86.49% <ø> (ø)
src/zarr/core/group.py 94.87% <ø> (ø)

... and 1 file with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jul 09 '25 15:07 codecov[bot]