xarray
xarray copied to clipboard
Automatic duck array testing - reductions
This is the first of a series of PRs to add a framework to make testing the integration of duck arrays as simple as possible. It uses hypothesis
for increased coverage and maintainability.
- [x] Tests added
- [x] Passes
pre-commit run --all-files
- [ ] User visible changes (including notable bug fixes) are documented in
whats-new.rst
- [ ] New functions/methods are listed in
api.rst
@shoyer: I asked about this in pytest-dev/pytest#8450, but it seems we are on our own here. The recommendation was to use inheritance, but that currently breaks parametrize
(see the pad
tests in test_variable
).
I wonder if using hypothesis
would make this easier? It would probably require a bit more reading because I don't yet fully understand how to choose the properties but dynamic parametrization would not be necessary: we would pass strategies instead of creation functions and potentially find more bugs, at the cost of longer test runs.
@Zac-HD, would you have any advice for something like this?
Edit: this is a initial draft using hypothesis
Looking at https://github.com/keewis/xarray/compare/duckarray-tests...duckarray-tests-hypothesis, for high-level feedback:
- Overall it looks pretty good; though ping me again if/when it's a PR and I'll do line-level feedback on idiom issues
- A more general test would generate the shapes, and the axes to reduce over - reducing a 1D array over the first dimension is going to miss things
- You use
@st.composite
when the.map()
method and a lambda would suffice (though the perf gain is small enough that this is mostly a readability issue) - I don't see the point of
Label
, and we advise against mixing "a strategy or a value". We break this rule a few times for backwards-compatibility in our Numpy support, but wouldn't write such an API these days.
And I'm always delighted to see people using Hypothesis to test libraries that I use and love π₯°π€©
For more context, the main idea is to provide a generic and cheap way to check the compatibility of any duckarray (including nested duckarrays) with xarray
's interface. I settled on having a function construct a test class based on callbacks and marks, but I'm not sure this is the best way (it is the best I can come up with, however).
I'm considering hypothesis
(hence the separate branch) because I was thinking that it might be easier to go through with the whole "have a function accepting some callbacks and marks generate the tests"-thing if instead of possibly parametrized callbacks I used hypothesis
' strategies. There are still some potential issues left, though, most importantly I'm trying to find a generic way to compute the expected values or expected failures.
Thanks for hints 2 and 3, I'm still not quite used to the concepts of hypothesis
. As for 4, xarray
has parameters which are intended for use in label-based indexing (e.g. with sel
), so I'm trying to find a way to mark those to be able to tell them from other types of parameters (like position-based indexers).
@shoyer, I finally found pandas
' extension array test suite again, which divides the tests into groups and provides base classes for each of them. It then uses fixtures to parametrize those tests, but I think we can use hypothesis
strategies instead.
This will most probably be a little bit easier to implement and maintain, and allow much more control, although parametrize
marks are still stripped so the tests will be a bit more verbose.
ping @shoyer, I'm mostly done with cleaning up the code.
@Zac-HD, could I ask for another review? I think I applied most of your suggestions but I'm sure I missed something.
is there a way to separate
sizes
intodims
andshape
without usingdraw
? I trieddims, shape = sizes.map(lambda s: tuple(zip(*s)))
, but since the map strategy is not iterable the final unpacking fails.
You've just got to use draw
for this.
I'd prefer the verbose repr of
st.builds()
when debugging strategies, but if I know the strategy is correct and understand what it generates I'd probably prefer the smaller repr of@st.composite
.
That's a good way of thinking about it :slightly_smiling_face:
Unit Test Results
βββββββββ6 filesβββββββββββ6 suitesβββ58m 27s :stopwatch: 16β289 testsβ14β528 :heavy_check_mark:β1β753 :zzz:βββ8 :x: 90β930 runsββ82β604 :heavy_check_mark:β8β284 :zzz:β42 :x:
For more details on these failures, see this check.
Results for commit 1d98fec1.
:recycle: This comment has been updated with latest results.
The pint
tests pass so all that's left is to figure out how to fix the sparse
tests.
sparse
seems to have different dtype
casting behavior (it casts e.g. a mean of float16
to float64
, while numpy
stays at float16
), so I'll need to figure out how to work around that. There's also a TypeError
, which I'll have to investigate.
The pint tests pass so all that's left is to figure out how to fix the sparse tests.
Can we xfail these new tests and merge for now? As long as tests/test_sparse.py
still passes, there are no issues, correct?
Can we xfail these new tests and merge for now? As long as tests/test_sparse.py still passes, there are no issues, correct?
Yeah we now have another array type to consider testing, so I'm also in favour of merging now with passing tests for pint, and un-xfailing tests for other array types (i.e. sparse) in a later PR.
sparse seems to have different dtype casting behavior (it casts e.g. a mean of float16 to float64, while numpy stays at float16), so I'll need to figure out how to work around that.
Sparse doesn't support float16
~The remaining local failure is https://github.com/pydata/xarray/issues/6817 (perhaps we should run numpy tests too ;) )~
Failure is https://github.com/pydata/sparse/issues/553
the remaining failures for pint
are #6822.
I also wonder whether we should have a separate job for hypothesis
: this makes CI run quite a bit longer (about 70s on my machine just for sparse
and pint
reduce tests, which are a small part of the API)
I also wonder whether we should have a separate job for hypothesis: this makes CI run quite a bit longer (about 70s on my machine just for sparse and pint reduce tests, which are a small part of the API)
Sounds good to me!
xref pydata/sparse#555
I'm watching the progress of this PR with bated breath! I literally want to be able to test 3 different array libraries right now: pint, cubed, and awkward. :exploding_head:
Thinking about the complexity of testing like this in general I have a bunch of follow-up questions for future PRs though, e.g:
- Shouldn't we start with some very simple tests that first check if the correct properties are defined on the wrapped array class, i.e.
shape
,dtype
etc? - What's the easiest way to expose a framework of tests like this to the user, so they can import them and run them on any duck array type?
- Docs on how to use these public tests?
- Can we possibly allow the user to run tests on data they created? I'm thinking in particular of creating ragged awkward arrays and then testing reduce etc. on them.
Should I therefore make a separate issue to specifically track how we test (and expose test frameworks for) duck array wrapping? (So we can discuss these questions there?)
I started with reduce
because that seemed to be the easiest to check, but I guess I was wrong? In any case, checking that the wrapped data is indeed the expected should be very easy to add (not this PR, though, that has stalled long enough).
The idea is to do something like pandas
's ExtensionArray
test suite that exposes a set of base classes that can be inherited from. So to test duck array support in a downstream library, you'd inherit from the appropriate classes and override create
(which returns a strategy for the tested array) and check_*
to check properties of the expected result (I think that's what you're asking for in 4).
It definitely doesn't feel very polished at the moment, but hopefully we can figure out a way to fix that (and then we can also hopefully figure out a good way to document this).
Edit: also, let's discuss the general plans in a new issue
See https://github.com/pydata/xarray/issues/6894 for general discussion of the general plans
Q: Shouldn't the base classes live in xarray.testing
rather than xarray.tests
?
Then the test_units.py
and test_sparse.py
live in xarray.tests
(until they eventually get moved to another repo), but import the base classes from xarray.tests
?
Q: Shouldn't the base classes live in xarray.testing rather than xarray.tests?
Another Q on a similar note: Are we planning to eventually publicly expose the (awesome btw) strategies that you've built here @keewis ? They could be very useful for testing other parts of xarray.
We could also make this PR much more incremental by splitting it into 2, or even 3 separate PRs:
- strategies, to live somewhere like
xarray.testing.strategies
and eventually be made public - duck array base classes to inherit from, to live somewhere like
xarray.testing.duckarrays
- specific tests for pint/sparse, to live in our own test suite for now but moved out eventually.
The advantage of that would be that (1) & (2) can move forwards without requiring all the tests in (3) to pass.
I also agree that both the strategies as well as the base classes should eventually live in xarray.testing, but I'd probably keep them in xarray.tests for now (just so we can experiment a bit more).
So I may actually have overexcitedly already jumped the gun and made a PR to move strategies to xarray.testing.strategies
last night :sweat_smile: #6908 We might just want to wait to merge this before merging that though anyway.
Note, however, that it's probably best not to separate 2 and 3 because there might be issues with the base class design we don't notice without trying to actually use them.
Good point.
We might just want to wait to merge this before merging that though anyway.
I actually think it should be the other way around: if we can get the strategies from #6908 to shrink well, we might be able to fix the occasional test timeouts here (which should be one of the final issues we have before we can merge this).
if we can get the strategies from https://github.com/pydata/xarray/pull/6908 to shrink well
I think they already do shrink well. Each of them has corresponding unit tests, and none of those tests fail due to hypothesis timeouts.