xarray More granularity in the CI, separating code and docs changes?

What is your issue?

Hi,

TLDR: Is there a way to only run relevant CI checks (eg documentation) when a new commit is pushed on a PR's branch?

The following issue is written from a naive user point of view. Indeed I do not know how the CI works on this project. I constated that when updating an existing Pull Request, the whole test battery is re-executed. However, it is a common scenario that someone wants to update only the documentation, for instance. In that case, it might make sense to only retrigger the documentation checks. A little bit like pre-commit that only runs on the updated files. Achieving such a level of granularity is not desirable as even a small code change could make geographically remote tests in the code fail, however, a high-level separation between code and docs for instance, might relieve a little bit the pipelines. This is assuming the code does not depend at all on the code. Maybe other separations exists, but the first I can think of is code vs docs.

Another separation would be to have an "order" / "dependency system" in the pipeline. Eg, A -> B -> C ; if A fails, there is no point into taking resources to compute B as we know for sure the rest will fail. Such a hierarchy might be difficult for the test matrix that is unordered (eg Python Version x OS, on this project it seems to be more or less (3.9, 3.10, 3.11, 3.12) x (Ubuntu, macOS, Windows)

There is also a notion of frequency and execution time: pipelines' stages that are the most empirically likely to fail and the shortest to runshould be ran first, to avoid having them fail due to flakiness and out of bad luck when all the other checks passed before. Such a stage exists: CI / ubuntu-latest py3.10 flaky (it is in the name). Taking that into account, the CI Additional / Mypy stage qualifies for both criteria should be ran before everything else for instance. Indeed, it is static code checking and very likely to fail, something a developer might also run locally before committing / pushing, and only takes one minute to run (compared to several minutes for each of stages of the Python Version x OS matrix). The goal here is to save resources (at the cost of losing the "completeness" of the CI run)

Feb 04 '24 20:02 etienneschalk

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

Feb 04 '24 20:02 welcome[bot]

Thanks @etienneschalk These are good ideas.

Some things to consider:

The docs includes docstrings that are in the code, so we do want to build this on every change. It'd be nice if RTD was smarter about caching previous builds though.
In my experience, it is common to have the shortest runs .e.g mypy or the minimal-deps env, fail in the early stages of iteration, and then fix them up when the PR gets close to complete. So it wouldn't be nice to cancel the rest if these fail.
We used to have mypy as a pre-commit stage but it places a high bar on new contributors.

Feb 04 '24 21:02 dcherian

Hi @dcherian,

Thanks for your answer!

OK, a code change should trigger a doc build because of the docstrings, but what about the opposite? Here is how I see it (doc is always built):

	code change	doc change
code checks	:white_check_mark:	:white_check_mark:
doc build	:white_check_mark:	:x:

OK, I see the point. I understand: shortest runs can be more cosmetic, and even if they fail, one might want to see if the battery of tests pass before fixing them at the end of the PR.
Indeed satisfying MyPy is not an easy task... I saw there is also Pyright stage that seems to be always skipped. Has it been considered switching from MyPy to Pyright? I recently tested Pyright (also including it in pre-commit), and I found the error messages to be rather insightful, reminding me of the TypeScript development experience.

Flaky Tests

I define a flaky test as a test that satisfy those conditions:

The test seems unrelated to the content of the pull request. This condition solely is not enough as some small code changes can have impacts far away in the codebase (hence the need of the following second condition) ;
it seems to be randomly failing in one stage of the Python Version x OS matrix, but not the others ;
it may have succeeded in a previous similar run for the same PR.

Example

Example of flaky tests from my recent experience, on the same PR, failing on two different runs:

FAILED xarray/tests/test_backends.py::TestDask::test_dask_roundtrip only for stage windows-latest py3.12
Multiple tests failing due to Timeout >180.0s only for stage macos-latest py3.11 (note: TestDask.test_dask_roundtrip is mentioned)

Examples from other PRs:

Only the ubuntu-latest py3.12 stage fails because of a Timeout >180.0s and TestDask.test_dask_roundtrip mentioned

-> Conclusion: test_dask_roundtrip seems to be pretty flaky at the current moment

Feb 04 '24 21:02 etienneschalk

I saw there is also Pyright stage that seems to be always skipped. Has it been considered switching from MyPy to Pyright?

Before that we'll need to do something with the 2000+ errors reported from pyright after parsing the whole codebase :).

(with some configuration tweaks and excluding some folders that can be safely ignored it is reduced down to ~1400 errors, which is still a lot).

Feb 05 '24 12:02 benbovy

@etienneschalk Yes you're right! We could run the code checks only if files in xarray/* or ci/* are changed (I think).

A PR would be welcome, here are some docs I found. https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#onpushpull_requestpull_request_targetpathspaths-ignore

Feb 06 '24 15:02 dcherian

Example of flaky tests from my recent experience, on the same PR, failing on two different runs:

* `FAILED xarray/tests/test_backends.py::TestDask::test_dask_roundtrip` only for stage [windows-latest py3.12](https://github.com/pydata/xarray/actions/runs/7776497338/job/21203724620?pr=8698#logs)

* Multiple tests failing due to `Timeout >180.0s` only for stage [macos-latest py3.11](https://github.com/pydata/xarray/actions/runs/7768517575/job/21186540956#logs) (note: `TestDask.test_dask_roundtrip` is mentioned)

I have been marking flaky tests as xfailed, and think we should continue to do that.

If someone has a plan to fix them, that's even better, but in lieu of that I would support more xfailing...

Feb 06 '24 16:02 max-sixty

@benbovy this seems like a colossal task indeed! This may be reasonably impossible actually. Even for small codebases, this can rapidly get out of hand to type a posteriori, and even sometimes may lead to "forced typed code" that is less elegant or even capable of introducing bugs...

@max-sixty

I have been marking flaky tests as xfailed, and think we should continue to do that.

From my recent experience, I would be in for marking test_dask_roundtrip as flaky

I see there is a only instance of @pytest.mark.flaky from seven years ago

Otherwise, to use xfail, this sentence seems to summarize the best:

@pytest.mark.xfail(reason="Flaky test. Very open to contributions on fixing this")

Feb 06 '24 20:02 etienneschalk

xarray xarray copied to clipboard

More granularity in the CI, separating code and docs changes?

What is your issue?

Flaky Tests

Example

xarray
xarray copied to clipboard