array-api-tests icon indicating copy to clipboard operation
array-api-tests copied to clipboard

Providing reproduction code for failed test cases

Open mtsokol opened this issue 7 months ago • 9 comments

Hi all!

Here's one idea on how to potentially improve interacting with failing tests in array-api-tests suite which @ev-br and I discussed this week.


I've been involved in the introduction of array-api-tests suite in a few repositories and, from my personal experience, the activity that took the most of the time was figuring out root causes of failed tests.

Each failure shows the stack trace (or multiple stack traces, if there were e.g. 3 distinct failures for a given test), but at times they are merely related to the actual root cause. For instance, in NumPy we have an xfail:

# fails on np.repeat(np.array([]), np.array([])) edge test case
array_api_tests/test_manipulation_functions.py::test_repeat

which originally was reported by the test suite with an error:

Cannot cast array data from dtype('float64') to dtype('int64')

I had to manually recreate the function call with the exact inputs to understand which edge case we hit. And for each test the error message was either accurate, like missing keyword argument, or irrelevant - which was mostly related to array scalars, Python builtins, or 0-D arrays as inputs or outputs.

From my point of view, one possible improvement of this process could be something like a CLI option --with-repro-snippets, where each failing test is accompanied by a copy-paste line that exactly reproduces it.

So for test_repeat I would get:

Here's a line that reproduces it:
xp.repeat(xp.asarray([]), xp.asarray([]))

The array-api-tests suite would compose it with f"xp.{func_name}({inputs}, {kwargs})" when a function call fails.

This way, when I run a test suite for the first time and get e.g. 50 failures, I can iterate a bit faster by skipping the "reproduce failing code snippet" step. WDYT? Please share your thoughts!


More of a nitpick but some Array API functions are used in multiple tests, like reshape. I think it’s used primarily for setting up inputs for tests. When that one function is missing from the namespace, a large part of the test suite fails with more cryptic error messages. I think that there could be a hasattr(xp, "reshape") decorator for tests and if a function for setting inputs is missing, the error says so.

mtsokol avatar May 21 '25 10:05 mtsokol

Thanks @mtsokol !

This is a great idea. A couple of quick comments:

  • hypothesis does indeed sometimes make figuring out the inputs quite hard!
  • the current ph.assert_* helpers go to incredible lengths to give useful diagnostics. Maybe we could improve specific helpers? You mention test_repeat, do you happen to remember other cases with less-than-useful diagnostics?
  • As an example where the full output is not that helpful: https://github.com/data-apis/array-api-tests/actions/runs/15072644093/job/42372581562
  • You mention xp.reshape, but https://github.com/search?q=repo%3Adata-apis%2Farray-api-tests%20reshape&type=code shows what, three calls?

ev-br avatar May 21 '25 20:05 ev-br

the current ph.assert_* helpers go to incredible lengths to give useful diagnostics. Maybe we could improve specific helpers? You mention test_repeat, do you happen to remember other cases with less-than-useful diagnostics?

I can take a look at the tests that I've fixed in the past and provide a few more next week!

You mention xp.reshape, but https://github.com/search?q=repo%3Adata-apis%2Farray-api-tests%20reshape&type=code shows what, three calls?

In array-api-tests that's correct. But the situation that I described originated in the Hypothesis package (sorry for omission). In def do_draw, that is used for drawing samples, Hypothesis calls reshape:

https://github.com/HypothesisWorks/hypothesis/blob/366e5e58fa5fb5cab088cef121e0775254b28a2c/hypothesis-python/src/hypothesis/extra/array_api.py#L452

When xp.reshape was temporarily broken it caused 148 errors for us in CI, you can search reshape word there.

mtsokol avatar May 23 '25 09:05 mtsokol

I can take a look at the tests that I've fixed in the past and provide a few more next week!

Thanks! I'd be very helpful.

In array-api-tests that's correct. But the situation that I described originated in the Hypothesis package (sorry for omission). In def do_draw, that is used for drawing samples, Hypothesis calls reshape:

That's a good point. Not much we can do about it at the array-api-tests level though. Looking at that hypothesis/extra/array_api module, it uses self.xp.reshape, self.xp.asarray, self.xp.zeros, self.xp.isnan. If any of these is broken, then indeed, so is the whole of array-api-tests. That's unfortunate, but I don't know if there's an easy way out. An "obvious" solution to generate test arrays with numpy and xp.asarray them will not work for non-CPU devices. When dlpack 1.0 is mature enough and is available everywhere, we can possibly update hypothesis/extra/array_api to generate numpy arrays and xp.from_dlpack them, but then a new array library would need to start from dlpack.

ev-br avatar May 23 '25 10:05 ev-br

Would it make sense to require reshape, asarray, zeros, isnan to run the test suite? These ones could be considered essential.

mtsokol avatar May 23 '25 10:05 mtsokol

You mean add a note to this effect to the "Interpreting errors" section of the README? https://github.com/data-apis/array-api-tests/blob/master/README.md

As a usability check: would this be useful to you when you had a issues with xp.reshape? If yes, sure, please send a PR! If not, then what would be?

ev-br avatar May 23 '25 17:05 ev-br

You mean add a note to this effect to the "Interpreting errors" section of the README? https://github.com/data-apis/array-api-tests/blob/master/README.md

Yes, I think a note listing which functions are essential to run most of the test suite would be useful.

As a usability check: would this be useful to you when you had a issues with xp.reshape? If yes, sure, please send a PR! If not, then what would be?

Here I think my answer is also "yes", but I'm not sure how it would impact execution time checking it for each test, and that's a bit noisy change for the repository:


ESSENTIAL_FUNCS_MISSING = any(not hasattr(xp, name) for name in ["reshape", "asarray", "zeros", "isnan"])

ensure_essentials = pytest.mark.skipif(
    ESSENTIAL_FUNCS_MISSING,
    reason="Essential function are missing from the namespace: ...",
)

# in all relevant files:

@ensure_essentials
def test_function(): ...

mtsokol avatar May 26 '25 11:05 mtsokol

As long as it's a couple of hasattr checks, it's not a problem (any hypothesis work dwarfs that, by far).

Basically, what you talk about is https://github.com/data-apis/array-api-tests/issues/51 --- and if we do it, then each test should get its decorator with the actual list of its dependencies. (Nevermind the "low priority" label on that issue; this is I think a useful thing to do).

OTOH, if the main pain point is with the hidden dependencies of hypothesis.array_api, then a single test probably suffices?

ev-br avatar May 26 '25 13:05 ev-br

When developing ndonnx, we sometimes encountered issues with these "essential" functions, too. However, as the package matured, we passed that bar and have not had issues in this regard in a long time. I think our development process would have been a bit easier if the test suite had forced us to implement these essential functions first and explicitly.

cbourjau avatar Jun 02 '25 08:06 cbourjau

Here's a low-tech POC to emit exact reproducing snippets for failed tests: https://github.com/data-apis/array-api-tests/pull/384 (the bulk of the diff is a trivial indentation change, so it is best viewed with the "hide whitespace" option)

ev-br avatar Jun 06 '25 16:06 ev-br