CI: parametrize over Array API modules
towards https://github.com/data-apis/array-api-tests/issues/235
This seems to be conceptually wrong. Now, what do we actually want to be testing here:
- a pedantic implementation of the spec? then we only need to run
array-api-strict, as before - whether specific array libraries implement the standard? They don't, as failures show, and is this not what
array-api-compatis for - whether
array-api-compatwrappers are compatible with the spec? then we need to add-compatas a dependency.
I'm sure this was discussed before, so am going to stop for now :-).
EDIT: or maybe something like this can be useful for reporting, if we ever want to autogenerate compatibility reports for bare array api namespaces.
The thing we really want to test here is that the tests themselves don't break whenever we update them or write new ones. This is especially important if we want to do some more serious refactorings here. The test suite is a little odd because we often will "break" upstream by adding a test that fails, because the upstream library is not conformant but the thing was just never tested before. But we don't want to "break" upstream by adding a test that's actually wrong, or testing something in a way that isn't really actually required by the standard (an example would be a floating-point closeness test that is too strict).
So when it comes down to it, what should go on CI here really just comes down to developer convenience. The test suite will be run upstream eventually anyways, by the upstreams themselves (and compat is probably the primary upstream since it tests 4 different libraries). On the one hand, testing more libraries here will catch more things. But on the other hand, we have to maintain a set of skips/xfails here to effectively do that, because many failures are expected. This is the case even for compat. We could reuse the xfail files in the compat repo, although that would effectively amount to just duplicating the CI here from the compat repo. The benefit of that would be catching failures faster.
But also consider that this repo doesn't have releases. If something breaks, you can very easily revert it.
The strict library is nice because it has almost no xfails. Plan NumPy 2.1 is like that as well, although that could change whenever new functions get added to the standard.
For now, I've made heavy use of local testing against various libraries when doing development here. If you feel that having that on CI would be better, then you should set it up. It really comes down to what's the most convenient. That would mean either maintaining a set of xfails, or just ignoring failures entirely and just using the CI as a vehicle to run the tests. You definitely should check any updated test against the major libraries one way or the other.
Regarding reporting, there was a lot of work on this previously (see reporting.py), and there was much discussion about whether the tests should be run here or in the upstream projects. The idea at the time was to run the tests in the upstream projects and have them upload reports.
That work will probably start up again at some point. For what it's worth, I feel now that the approach of running the reporting tests in the upstream repos was getting too complicated, and so I would now suggest that just running the reporting tests here or in some other data-apis controlled repo would be a much better approach.