Make `create_array` signatures consistent
This PR ensures that the various invocations of create_array are consistent. For reference, we have 4 ways to call create_array:
zarr.core.array.create_array(the actual function that does stuff)zarr.api.synchronous.create_array(synchronous wrapper around the async function)zarr.core.group.AsyncGroup.create_array(method onAsyncGroupclass that invokescreate_array)zarr.core.group.Group.create_array(synchronous wrapper around the AsyncGroup method)
All of these functions should have consistent signatures, but in main they don't, and a big part of this is missing tests. So this PR adds some tests that check that certain pairs of functions have identical parameters (we can't force the return types to match, because the async functions will return coroutines). To make the tests pass, this PR also ensures that all of the invocations of create_array have parameters that are consistent.
I say "consistent" because, when invoking Group.create_array, that method does not take a store argument, because we have one already from the group instance. Also, Group.create_array was recently given an extra keyword argument (compressor) that we need to deprecate. So the test that compares zarr.core.array.create_array with zarr.core.group.AsyncGroup.create_array only checks that all of create_array parameters are present in the group method.
closes #2810
worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more defaulty.
worth noting that this changes some defaults (we were using a default fill_value of
Nonein some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more default.
fill_value = None seems a more sensible default to me, with some docs somewhere explaining how this maps to a default fill value for each data type. 0 needs conversion for any non-integer data type, and this is exactly what caused https://github.com/zarr-developers/zarr-python/issues/2792#issuecomment-2644362122 in zarr-python 2.
worth noting that this changes some defaults (we were using a default fill_value of
Nonein some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more default.
fill_value = Noneseems a more sensible default to me, with some docs somewhere explaining how this maps to a default fill value for each data type.0needs conversion for any non-integer data type, and this is exactly what caused #2792 (comment) inzarr-python2.
That works for me. In main I think we have a mix of 0 and None floating around, I think I chose 0 for this PR because it seemed like the majority, but I'm happy to switch to None here
with https://github.com/zarr-developers/zarr-python/pull/2819/commits/642272dcaf0ae313034046a494225ae4a87d6d31 the default fill value is None for the array creation funcs.
Thanks for the changelog update, looks great to me. I left one more suggestion - I did a bit of testing and this doesn't seem to change the fill value that's set on the arrays (which is good, but I think worth reassuring users)
how did you check this? because for zarr v2 data, it should make a difference -- None is a valid v2 fill value, and it's not the same as 0
I didn't check for v2 🙈 . If it does make a difference, that should be made very clear in the changelog (and we should think about putting this PR in a non-bugfix release since it's a breaking change?)
how did you check this? because for zarr v2 data, it should make a difference --
Noneis a valid v2 fill value, and it's not the same as 0
I didn't check for v2 🙈 . If it does make a difference, that should be made very clear in the changelog (and we should think about putting this PR in a non-bugfix release since it's a breaking change?)
I'd hope that since the below PRs, the metadata of the Zarr V2 data with fill_values=None will have a null fill value but actual stored fill values will match the V3 default rather than being completely undefined.
- https://github.com/zarr-developers/zarr-python/pull/2274
- https://github.com/zarr-developers/zarr-python/pull/2799
@LDeakin that was my impression as well. I think for v2 data the contents of the array metadata will be sensitive to the None / 0 distinction, but at runtime the actual array data should be the same in zarr-python. but this might vary in other implementations.
I think for v2 data the contents of the array metadata will be sensitive to the None / 0 distinction, but at runtime the actual array data should be the same in zarr-python. but this might vary in other implementations
Indeed! To be honest, I think null fill values in Zarr V2 were a bit of a misstep anyway, mainly because undefined values in partially written chunks cannot be distinguished from real data. This could be a good opportunity for zarr-python to stop writing null fill values altogether in Zarr V2 metadata and just write the default fill value. What do you think?
The nullable fill value is also clunky because zarr v2 supports the python Object dtype, and None is a valid instance of the Object dtype, which makes it impossible to distinguish "no fill value" from "the fill value is the literal None" for arrays with that dtype (or any other dtype that admits None as a value).
We would have to get a measure of the impact on users, but if it doesn't inconvenience a lot of people then I would definitely support dropping the creation of arrays with a null fill value, at least for numeric types
that works for me! would you like to open an issue about the versioning policy, or should I?
I can 👍
it turns out that the default fill value of 0 is problematic for #2874, so I'd like to elevate the priority of this PR. Given our new versioning policy, do you think this is good to go after I sort out the merge conflicts @dstansby?
Is changing the default value of the fill_value argument a breaking change? Either way that should be made clearer in the changelog entry - and if it is breaking, we need a release plan for 3.1 before merging this, so we don't accidentally end up in a place where stuff targeted for 3.1 ends up blocking bug fix releases for a long time while all the work for 3.1 is finished.
ultimately I think whether these changes are breaking depends on whether you believe it's a bug that different definitions of the same function have different default values.
Does changing the values affect any behaviour, or any metadata that gets written to zarr arrays or groups? If not, then I'd say this isn't breaking. If it does, then I'd say it is breaking.
I think there are two questions:
- Is the current situation (the same function with different default values) bugged? I believe the answer to this is "yes".
- Is it possible to fix that bug without changing behavior of zarr-python? I believe the answer to this is "no".
Because this library produces data, we will inevitably have bugs that affect how data is produced. Therefore, fixing those bugs will change how data is produced. If the bug fix takes us from "writes invalid data" to "writes valid data", that's a clear win. But if the bug fix takes us from "writes valid data" to "writes slightly different valid data", then things are less clear cut. I think we are in this second scenario.
For simplicity lets say that these changes are breaking, and thus a good fit for a 3.1 release. it might be time to think about a 3.1 branch!
I think my confusion/questions are stemming from the short explanation of what's changed in the changelog entry. I think the default argument would benefit from being being added to it's own breaking or bugfix entry (whichever is more appropriate), and expanding to explain how this changes what metadata is written to files, and because it's a different fill value, how that changes the data values returned from emtpy portions of an array with fill_value=None vs. fill_value=0.
how that changes the data values returned from emtpy portions of an array with
fill_value=Nonevs.fill_value=0.
there is no change here, zarr-python uses an array of zeros (or the zeros-equivalent, per data type) in either case.
I still need to wrap my head around the fill value changes a bit - in the meantime, perhaps adding the new arguments could be done in a separate PR, since that's definitely backwards compatible and probably something we could put in a 3.0.x release?
i would rather not split this PR into pieces.
Ah, because that would ruin the automatic testing you added - gotcha. Will try and give this a proper re-review soon.
Pinging @zarr-developers/python-core-devs - would be good to get a second review on this
the test failure here is windows-specific and fixed in https://github.com/zarr-developers/zarr-python/pull/3151
these function signature tests are gold. they caught a bunch of recently-added inconsistencies
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 94.72%. Comparing base (378d5af) to head (48c8e3b).
:warning: Report is 123 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2819 +/- ##
==========================================
- Coverage 94.73% 94.72% -0.02%
==========================================
Files 78 78
Lines 8646 8646
==========================================
- Hits 8191 8190 -1
- Misses 455 456 +1
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/zarr/api/asynchronous.py | 86.49% <ø> (ø) |
|
| src/zarr/core/group.py | 94.87% <ø> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.