xarray Automatic duck array testing

Automatic duck array testing - reductions

Open keewis opened this issue 3 years ago • 21 comments

This is the first of a series of PRs to add a framework to make testing the integration of duck arrays as simple as possible. It uses hypothesis for increased coverage and maintainability.

[x] Tests added
[x] Passes pre-commit run --all-files
[ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
[ ] New functions/methods are listed in api.rst

Feb 27 '21 23:02 keewis

@shoyer: I asked about this in pytest-dev/pytest#8450, but it seems we are on our own here. The recommendation was to use inheritance, but that currently breaks parametrize (see the pad tests in test_variable).

I wonder if using hypothesis would make this easier? It would probably require a bit more reading because I don't yet fully understand how to choose the properties but dynamic parametrization would not be necessary: we would pass strategies instead of creation functions and potentially find more bugs, at the cost of longer test runs.

Mar 16 '21 15:03 keewis

@Zac-HD, would you have any advice for something like this?

Edit: this is a initial draft using hypothesis

Mar 26 '21 18:03 keewis

Looking at https://github.com/keewis/xarray/compare/duckarray-tests...duckarray-tests-hypothesis, for high-level feedback:

Overall it looks pretty good; though ping me again if/when it's a PR and I'll do line-level feedback on idiom issues
A more general test would generate the shapes, and the axes to reduce over - reducing a 1D array over the first dimension is going to miss things
You use @st.composite when the .map() method and a lambda would suffice (though the perf gain is small enough that this is mostly a readability issue)
I don't see the point of Label, and we advise against mixing "a strategy or a value". We break this rule a few times for backwards-compatibility in our Numpy support, but wouldn't write such an API these days.

And I'm always delighted to see people using Hypothesis to test libraries that I use and love 🥰🤩

Mar 27 '21 13:03 Zac-HD

For more context, the main idea is to provide a generic and cheap way to check the compatibility of any duckarray (including nested duckarrays) with xarray's interface. I settled on having a function construct a test class based on callbacks and marks, but I'm not sure this is the best way (it is the best I can come up with, however).

I'm considering hypothesis (hence the separate branch) because I was thinking that it might be easier to go through with the whole "have a function accepting some callbacks and marks generate the tests"-thing if instead of possibly parametrized callbacks I used hypothesis' strategies. There are still some potential issues left, though, most importantly I'm trying to find a generic way to compute the expected values or expected failures.

Thanks for hints 2 and 3, I'm still not quite used to the concepts of hypothesis. As for 4, xarray has parameters which are intended for use in label-based indexing (e.g. with sel), so I'm trying to find a way to mark those to be able to tell them from other types of parameters (like position-based indexers).

Mar 27 '21 14:03 keewis

@shoyer, I finally found pandas' extension array test suite again, which divides the tests into groups and provides base classes for each of them. It then uses fixtures to parametrize those tests, but I think we can use hypothesis strategies instead.

This will most probably be a little bit easier to implement and maintain, and allow much more control, although parametrize marks are still stripped so the tests will be a bit more verbose.

Apr 09 '21 22:04 keewis

ping @shoyer, I'm mostly done with cleaning up the code.

@Zac-HD, could I ask for another review? I think I applied most of your suggestions but I'm sure I missed something.

Apr 23 '21 18:04 keewis

is there a way to separate sizes into dims and shape without using draw? I tried dims, shape = sizes.map(lambda s: tuple(zip(*s))), but since the map strategy is not iterable the final unpacking fails.

You've just got to use draw for this.

I'd prefer the verbose repr of st.builds() when debugging strategies, but if I know the strategy is correct and understand what it generates I'd probably prefer the smaller repr of @st.composite.

That's a good way of thinking about it :slightly_smiling_face:

Apr 29 '21 04:04 Zac-HD

Unit Test Results

        6 files         6 suites 58m 27s :stopwatch: 16 289 tests 14 528 :heavy_check_mark: 1 753 :zzz:   8 :x: 90 930 runs 82 604 :heavy_check_mark: 8 284 :zzz: 42 :x:

For more details on these failures, see this check.

Results for commit 1d98fec1.

:recycle: This comment has been updated with latest results.

Jul 01 '21 17:07 github-actions[bot]

The pint tests pass so all that's left is to figure out how to fix the sparse tests.

sparse seems to have different dtype casting behavior (it casts e.g. a mean of float16 to float64, while numpy stays at float16), so I'll need to figure out how to work around that. There's also a TypeError, which I'll have to investigate.

Aug 15 '21 20:08 keewis

The pint tests pass so all that's left is to figure out how to fix the sparse tests.

Can we xfail these new tests and merge for now? As long as tests/test_sparse.py still passes, there are no issues, correct?

Nov 24 '21 02:11 dcherian

Can we xfail these new tests and merge for now? As long as tests/test_sparse.py still passes, there are no issues, correct?

Yeah we now have another array type to consider testing, so I'm also in favour of merging now with passing tests for pint, and un-xfailing tests for other array types (i.e. sparse) in a later PR.

Jul 19 '22 01:07 TomNicholas

sparse seems to have different dtype casting behavior (it casts e.g. a mean of float16 to float64, while numpy stays at float16), so I'll need to figure out how to work around that.

Sparse doesn't support float16

~The remaining local failure is https://github.com/pydata/xarray/issues/6817 (perhaps we should run numpy tests too ;) )~

Failure is https://github.com/pydata/sparse/issues/553

Jul 22 '22 23:07 dcherian

the remaining failures for pint are #6822.

I also wonder whether we should have a separate job for hypothesis: this makes CI run quite a bit longer (about 70s on my machine just for sparse and pint reduce tests, which are a small part of the API)

Aug 03 '22 11:08 keewis

I also wonder whether we should have a separate job for hypothesis: this makes CI run quite a bit longer (about 70s on my machine just for sparse and pint reduce tests, which are a small part of the API)

Sounds good to me!

Aug 03 '22 15:08 dcherian

xref pydata/sparse#555

Aug 08 '22 15:08 keewis

I'm watching the progress of this PR with bated breath! I literally want to be able to test 3 different array libraries right now: pint, cubed, and awkward. :exploding_head:

Thinking about the complexity of testing like this in general I have a bunch of follow-up questions for future PRs though, e.g:

Shouldn't we start with some very simple tests that first check if the correct properties are defined on the wrapped array class, i.e. shape, dtype etc?
What's the easiest way to expose a framework of tests like this to the user, so they can import them and run them on any duck array type?
Docs on how to use these public tests?
Can we possibly allow the user to run tests on data they created? I'm thinking in particular of creating ragged awkward arrays and then testing reduce etc. on them.

Should I therefore make a separate issue to specifically track how we test (and expose test frameworks for) duck array wrapping? (So we can discuss these questions there?)

Aug 08 '22 16:08 TomNicholas

I started with reduce because that seemed to be the easiest to check, but I guess I was wrong? In any case, checking that the wrapped data is indeed the expected should be very easy to add (not this PR, though, that has stalled long enough).

The idea is to do something like pandas's ExtensionArray test suite that exposes a set of base classes that can be inherited from. So to test duck array support in a downstream library, you'd inherit from the appropriate classes and override create (which returns a strategy for the tested array) and check_* to check properties of the expected result (I think that's what you're asking for in 4).

It definitely doesn't feel very polished at the moment, but hopefully we can figure out a way to fix that (and then we can also hopefully figure out a good way to document this).

Edit: also, let's discuss the general plans in a new issue

Aug 08 '22 16:08 keewis

See https://github.com/pydata/xarray/issues/6894 for general discussion of the general plans

Aug 08 '22 18:08 TomNicholas

Q: Shouldn't the base classes live in xarray.testing rather than xarray.tests?

Then the test_units.py and test_sparse.py live in xarray.tests (until they eventually get moved to another repo), but import the base classes from xarray.tests?

Aug 09 '22 18:08 TomNicholas

Q: Shouldn't the base classes live in xarray.testing rather than xarray.tests?

Another Q on a similar note: Are we planning to eventually publicly expose the (awesome btw) strategies that you've built here @keewis ? They could be very useful for testing other parts of xarray.

We could also make this PR much more incremental by splitting it into 2, or even 3 separate PRs:

strategies, to live somewhere like xarray.testing.strategies and eventually be made public
duck array base classes to inherit from, to live somewhere like xarray.testing.duckarrays
specific tests for pint/sparse, to live in our own test suite for now but moved out eventually.

The advantage of that would be that (1) & (2) can move forwards without requiring all the tests in (3) to pass.

Aug 11 '22 06:08 TomNicholas

I also agree that both the strategies as well as the base classes should eventually live in xarray.testing, but I'd probably keep them in xarray.tests for now (just so we can experiment a bit more).

So I may actually have overexcitedly already jumped the gun and made a PR to move strategies to xarray.testing.strategies last night :sweat_smile: #6908 We might just want to wait to merge this before merging that though anyway.

Note, however, that it's probably best not to separate 2 and 3 because there might be issues with the base class design we don't notice without trying to actually use them.

Good point.

Aug 11 '22 15:08 TomNicholas

We might just want to wait to merge this before merging that though anyway.

I actually think it should be the other way around: if we can get the strategies from #6908 to shrink well, we might be able to fix the occasional test timeouts here (which should be one of the final issues we have before we can merge this).

Aug 16 '22 10:08 keewis

if we can get the strategies from https://github.com/pydata/xarray/pull/6908 to shrink well

I think they already do shrink well. Each of them has corresponding unit tests, and none of those tests fail due to hypothesis timeouts.

Aug 16 '22 13:08 TomNicholas

xarray xarray copied to clipboard

Automatic duck array testing - reductions

Unit Test Results

xarray
xarray copied to clipboard