Pain points in using the test suite
There were reports of several pain points with using the test suite. This is a central tracker for potential improvements of the test suite ergonomics. Things previously reported:
- difficulty in tracking expected failures across libraries. ISTM this is now as good as it gets, I don't know how to improve on the current
--skips-fileand--xfails-fileusage. - https://github.com/data-apis/array-api-tests/issues/197 reported that the test suite is prohibitively slow on
jax. Unclear if it is still a problem after the worst offenders were rewritten a while ago. There are several potential ideas in the issue; what's not clear ATM is whether this is still a problem. And if it is, what are the slowest tests and where is the time spent (for instance, for torch I observed that the slowest tests spent majority of time in the internals ofhypothesis). - https://github.com/data-apis/array-api-tests/issues/379 reported that it is difficult to interpret failures, and https://github.com/data-apis/array-api-tests/pull/380 works towards improving the reporting.
- gh-329 and gh-169 reported issues with optional extensions
If you're using the test suite, we'd like to hear from you. Please add the pain points and/or suggestions for improvement to this issue.
Let me ping @crusaderky for dask, @adityagoel4512 and @cbourjau for ndonnx: what's bad/difficult/inconvenient in the test suite?
- Tests are not very meaningful when a library is going to be used in production in a way that fundamentally alters the functionality it delivers. The key example here is
jax.jit. Important to note, that's not something that can be turned on with an env variable. I don't have a solution to this. - Hypothesis hurts when there is a failure on an exotic edge case, but the only way to push through is to xfail the whole function including all of its basic use cases. Example: https://github.com/dask/dask/issues/11800
IMHO array-api-tests is doing a great job already! In fact, I am currently building a similar test suite that is strongly inspired by array-api-tests for the ONNX standard itself here. That said, I think it is great to hear that the project seeks to improve further.
https://github.com/data-apis/array-api-tests/issues/197 reported that the test suite is prohibitively slow on jax. Unclear if it is still a problem after the worst offenders were rewritten a while ago. There are several potential ideas in the issue; what's not clear ATM is whether this is still a problem. And if it is, what are the slowest tests and where is the time spent (for instance, for torch I observed that the slowest tests spent majority of time in the internals of hypothesis).
While performance used to be an issue when running the test suite against ndonnx, this is no longer the case. One pain point that I do share is that interpreting failure is sometimes difficult. As discussed in #379 it would be great if the test suite could provide copyable snippets.
https://github.com/data-apis/array-api-tests/issues/329 and https://github.com/data-apis/array-api-tests/issues/169 reported issues with optional extensions
ndonnx also suffers from this paper cut, where missing extensions cannot be skipped by providing a CLI argument. While mildly annoying, we put those test cases into the skips.txt file and somewhat forgot about it.
Hypothesis hurts when there is a failure on an exotic edge case, but the only way to push through is to xfail the whole function including all of its basic use cases.
This issue is a situation that we have also encountered in ndonnx. However, the red CI has thus far always successfully forced us to find a solution to the underlying issue in a timely manner. It is purely anecdotal, but had it been easier to turn off the tests for such corner cases, ndonnx would be a less reliable and less standard-compliant library today. I realize there may be corner cases that are impossible to implement for some backends. However, rather than a general solution to turn off individual corner cases downstream, having an explicit upstream configuration for that particular class of issues may be good to maintain a certain standard in the ecosystem. For instance, one may consider adding an upstream flag to turn off certain zero-size inputs rather than to allow downstream test suite to define their test-by-test filtering.
Lastly, one pain point I have experienced and that is not yet named here: I find that understanding the workings of the test suite is challenging, making small contributions is difficult. A lot of code is executed in the file scope. I would find it easier to navigate the code base if it were (more) strictly typed.
An issue/idea regarding reproducing failed test cases: I have encountered cases where I would like to add an @example decorator to a test case, but don't want to patch the test case itself. A way to maintain a list of such examples in downstream projects may be handy. (Caveat: this likely hinges on the question of whether tests that use a data strategy can actually use examples, which I'm not sure about yet.)
whether tests that use a data strategy can actually use
examples
I suspect the answer is negative---did not test it myself though, so am open to pleasant surprises :-).
For reproducible snippets, I think a low-tech solution could be along the lines of this WIP patch: https://github.com/data-apis/array-api-tests/compare/master...ev-br:array-api-tests:repro_snippets?expand=1
It is a bit manual, and not "interesting" of course; the upside is that it's robust and probably applicable to the whole range of tests in the test suite, whether they use data.draw or fancy strategies.
- Hypothesis hurts when there is a failure on an exotic edge case, but the only way to push through is to xfail the whole function including all of its basic use cases. Example: Array.setitem fails on size 0 dask/dask#11800
Agreed with both that it's annoying (oh yes it is), and that this is by hypothesis design, and that the only structural fix is to fix bugs in the libraries itself (https://github.com/data-apis/array-api-tests/issues/381#issuecomment-2929556388).
A pragmatic thing to do for most egregious cases could be to try separating a test into two:
- limit the generation to avoid the failing case (e.g. an empty array on dask), and
- add an xfailed test which specifically probes edge case. This edge case test does not even need to use hypothesis.
This strategy won't fly if there are multiple edge cases of this kind, and we probably don't want to deviate too much from a one function one test maxim, but we can totally do it on a case by case basis where benefits clearly outweigh the costs.
- Tests are not very meaningful when a library is going to be used in production in a way that fundamentally alters the functionality it delivers. The key example here is jax.jit. Important to note, that's not something that can be turned on with an env variable. I don't have a solution to this.
Oh this is a very good point! Let's think this through. Pretty much all tests have the following structure:
@given(... generate inputs) # (1)
def test_foo():
# additional generation of inputs # (2)
out = foo(inputs) # (3)
# assertions about `out` # (4)
Out of these four, we only want (3) to depend on jit or not jit, and the rest are invariant. If so, we could wrap (3) into a function,
out = magick(foo)(inputs) # (3.1)
where magick is either a no-op or jax.jit, depending on an env variable or other config. That would involve updating all tests yes, but that's small potatoes if that unblocks the underlying problem. WDYT?
- limit the generation to avoid the failing case (e.g. an empty array on dask), and
- add an xfailed test which specifically probes edge case.
I'm a bit surprised that hypothesis doesn't offer a flag to generate only edge cases vs. everything but edge cases. It would be very easy to put that flag in pytest.mark.parameterize. Did anybody properly pour through their documentation / issue tracker? (I didn't).
Out of these four, we only want
(3)to depend on jit or not jit, and the rest are invariant. If so, we could wrap(3)into a function,out = magick(foo)(inputs) # (3.1)where
magickis either a no-op orjax.jit, depending on an env variable or other config. That would involve updating all tests yes, but that's small potatoes if that unblocks the underlying problem. WDYT?
I think you just described xpx.testing.lazy_xp_function 😄
I'm a bit surprised that hypothesis doesn't offer a flag to generate only edge cases vs. everything but edge cases. It would be very easy to put that flag in pytest.mark.parameterize. Did anybody properly pour through their documentation / issue tracker? (I didn't).
While I do not claim a deep undestanding of hypothesis, ISTM this is a hypothesis feature not bug. Consider the very first paragraph of the "Welcome to hypothesis" page:
... you write tests ... and let Hypothesis randomly choose which of those inputs to check - including edge cases you might not have thought about
From the "Domain and distribution" page (emphasis mine):
The domain is the set of inputs that should be possible to generate. ... The distribution is the probability with which different elements in the domain should be generated. ... Hypothesis takes a philosophical stance that property-based testing libraries, not the user, should be responsible for selecting the distribution. As an intentional design choice, Hypothesis therefore lets you control the domain of inputs to your test, but not the distribution.