grass
grass copied to clipboard
[Bug] Windows CI Tests returns success status even for failed tests
Describe the bug A clear and concise description of what the bug is.
The test suite is ran after the builds in CI for the three OSes. However, failed tests does not fail the step at the end of the test run. Failures seem to have gone unnoticed forever, even in main branch.
To Reproduce Steps to reproduce the behavior:
- Open any OSGeo4W workflow job log.
- Go to the Run tests step.
- See errors, like https://github.com/OSGeo/grass/actions/runs/7401286176/job/20136723302#step:9:31
Expected behavior A clear and concise description of what you expected to happen. Failed tests that aren't expected to fail on Windows, should result in a CI failure, or at least an annotation
Screenshots If applicable, add screenshots to help explain your problem. Right at the beginning, and throughout all the execution, we can see various errors, that don't all seem to be the same cause (except "Its all because they use Windows"
Screenshots
System description (please complete the following information):
- Operating System: Windows 2019 in GitHub Actions CI, current main branch
- GRASS GIS version: 8.4 (main branch as of 2024-01-04 and since a long time)
Additional context Add any other context about the problem here.
For almost a week now I've been exploring the possibilities to have the Windows CI be a bit faster. While the build step for Ubuntu builds takes around 4 minutes, macOS about 7.5 minutes, only the build step takes more than 20 minutes on Windows. I've finally worked around and understood a bit more, and easily cut that time in half. But since now the build step passes, I'm at the tests step, and I'm quite surprised to see that a whole lot of tests fail for all sorts of reasons, and no clue, apart reading the CI logs attentively shows it up. If this is the expected behaviour, I'll just remove the test steps and just save 50+ mins for each run. If we can't act or enforce no failed tests (even progressively), its useless.
It doesn't really help me optimize if I can't know when I break something
The number of "accepted" failures for win is 60 per cent, set with:
https://github.com/OSGeo/grass/blob/e4431cc79acf7f5ca42389c21e100f3c9bc51c56/.github/workflows/test_thorough.bat#L5
This is obviously not good, the reason to not remove the failing tests is to not forget them. Ideally they should be addressed.
Is there a way to annotate expected failures on each of the individual failed tests in the 58/60 files, and then enforce no new failures?
Or at least bring the percentage up, but it is tricky since adding a new passing test doesn't help it increase the percentage, while what we want is a baseline and only improve.
You can, at least on your testing fork, add a config file to exclude specific failing tests for windows, like for the Mac:
https://github.com/OSGeo/grass/blob/e4431cc79acf7f5ca42389c21e100f3c9bc51c56/.github/workflows/macos.yml#L74-L77
Thanks, @echoix for the effort.
There is also: https://github.com/OSGeo/grass/pull/2616
Unfortunately, I did not find the time to finish that off. Maybe better to adress failures one by one..
Unfortunately, it's not only macOS. We have number of excluded tests even for Linux (in .gunittest.cfg). Ideally, all tests would be fixed. Both the percentage and exclusion are workaround to get tests running in CI. Percentage was implemented first because that's just a stat which the original testing server (fatra) reported. With modern CI, pass-fail is more appropriate, hence the exclusion.
Before supper I was getting close to something. I understood that I needed to have a env var for GISBASE pointing to the install location, and found the correct parameter for pytest to pass to have the correct Python path, and now the only error left (for collecting the unittest tests with Pytest) was multiple library loading errors. I even got (only for pytest for now) integration into the vscode UI working, but running with the interface was really slow, and seemed to only update at the end of all tests, not as they are completed. I wanted to try with the different, more modern library loader that one collaborator in the ctypesgen repo wants to bring forward, as they use it in their pypdfium2 fork. They suggested it to me in a PR that I made yesterday or so to bring in the changes that I reverted here. I didn't have time to try it out yet. Also, since it was on Gitpod, and I took a break and closed the environment, the issue might have been the install step that was partly wiped, as I didn't recompile everything, only installed the leftover of the build.
All that to say, that if it works out, and as I see from the partially successful collection of tests, it seems that the unittest derived classes, our gunittest, might work correctly as is, and if so, we could use the pytest markers to achieve the selection of tests. Or at least use the unittest-based decorators, and they would immediately be supported in pytest without us having to code extra functions to something that is already too custom and specialized.
To follow...
And by the way, is there a way to not use all cores for running one test? In Gitpod, even though they say 4 cores with x memory and all, I always seem to have access to 16 (server) cores, and way more memory than my laptop computer. So when I run a test, it takes a while (maybe too much to my liking) to startup the test, then runs a module (like some kind of raster) with like 8 cores or more, but only for a couple of htop updates, then finishes. The problem is, that if I try to parallelize execution of tests, to eat up the dead time starting a test case, I'll end up with 16*16 or more subprocesses at a same time. I saw something in gisenv named NPROC or something close to it, but how do I set it without changing the test suite? Do I only have to add it as a env var in my command call?
I've got something that seems to work to mark tests conditionally expected to fail. It is a decorator, and works with existing gunittest.
What would be a better name than
@expectedFailureIfWindowsUntilFixed, where @expectedFailure is an existing decorator name in unittest that is called?
I've got something that seems to work to mark tests conditionally expected to fail. It is a decorator, and works with existing gunittest.
What would be a better name than
@expectedFailureIfWindowsUntilFixed, where@expectedFailureis an existing decorator name in unittest that is called?
The advantage of having the failing tests listed in an exclude file, is the simple overview it gives on what is not working. Moving that into code in the separate test files makes it less likely to be fixed, no?
... The advantage of having the failing tests listed in an exclude file, is the simple overview it gives on what is not working. Moving that into code in the separate test files makes it less likely to be fixed, no?
Correct, I think the ideal is one config file in the top directory without any excluded tests and tests actually running. The intermediate situation is multiple files with list of exclusions for tests which were never fixed to work in CI (note: some tests were created even in Subversion times before there was any CI automation in place, so there was no control over them passing or failing at the time of the initial commit or later).
Having "expected failure" in the the file usually indicates that there is no intention to fix this. Common, perfectly okay case is a missing optional dependency. Less ideal, but still valid is a tests failing because of a known bug (that bug should be reported and tracked in some other way in addition to the "expected failure").
You are bringing a different view, that I'm taking into account to find something that satisfies also this as requirements. But what I was taking into account, that I didn't explain yet, was more on a point of making sure the current state doesn't regress or semi-fixes itself without us knowing. In QA, a bug that is fixed that we don't know how it got fixed is worse than a bug that is still a bug. The way that I was hoping to accomplish, is to have a census of all the tests that don't work currently on windows, note their failures, and expect it to stay the same. Then we could upper the bar to require that all the rest must pass, and from the other tests, they don't change. If ever a test gets accidentally fixed, it will get flagged in CI (result would be unexpected success). If we can explain it, we remove that flag, and it shall never fail again, like any other test. That's the least invasive way for now.
To meet your points, I was trying hard this afternoon to adapt the reporting (at least console output side of things) to make sure the count of expected failures (and unexpected successes) show up at the end. But since gunittest has diverged a little bit from pure unittest, that now understands these concepts a bit more (on python 3.12 they can finally report durations like pytest) and should be able to report it, I didn't get through.
My attempts yesterday and a bit before was to get it working on pytest, and making my validation that my windows builds experiments weren't regressing by the results I could get from pytest on the side. I managed to fix (not in a clean form yet) the errors that prevented successful discovery of the tests, either by some dynamic setup calls that were in the class fields rather than in the setup class, and various patching of env vars, like LD_LIBRARY_PATH (that's a discussion for another day, why are we the only Linux software that needs to do this in order to start), the PYTHONPATH, GISBASE, and lastly GISRC. That last one I have some problems, since once I set it to a file, I get the next missing variable, LOCATION_NAME. I tried various ways of setting it, it seems it only listens to what is in the file contents of that GISRC file. It can't create folders by itself, it needs to be done before. And all of these presteps, only to be able to not do any GIS related work yet. Since in order to have any empty python test that ends up importing from our code, I end up needing to have Module working for r.info to get called to give its --interface-description, and since it's C code, and that the checks for gisdbase, location and mapset are made in G_gisinit (the preprocessor macro that redirects to the real function with two underscores), and G_gisinit makes these checks real early; they are made before we even check what we are asking for it, like if it's the interface description (it's underneath the options assignment, in G_parser, and I don't see how is reasonable[non breaking] to have it changed without doing like a snow globe on every bit of code). I've left out many in between steps that I've hooped through following parts of the chain of code. I'm quite surprised that so much is strictly required to get in an neutral state. My hope was that I was able to define a way to just act like the software was started, pointing to a testing environment, and be able to not affect the existing user files (the things in grassdata), to not have conflicting tests ruin each other. But I got really stuck and didn't really know how to continue getting forward with so many things to address before reaching my goal. I've also tried running inside Python wrapped through grass --tmp-location XY --exec ..., even if I don't like it, but it's not enough to get it started too.
My other idea, unexplored yet since it would probably be some little overhead, would be to make a fixture that does that work of fiddling around what each part of grass needs set up in order to boot up, and have the tests that need it use it when setup up the tests (it could be test-session wide or as needed, not necessarily at each test). But what was keeping me off of doing this, is that I still need to understand all of what is needed to get it working anyways, so I was better off trying to get it working manually from the exterior, and if needed, pack it into a fixture. I'm pretty certain that gunittest started up like this, but now, itself too needs a pre-working setup in order to start making the rest work.
Holidays are soon over, it'll be back to normal speed in a short time.
...In QA, a bug that is fixed that we don't know how it got fixed is worse than a bug that is still a bug. The way that I was hoping to accomplish, is to have a census of all the tests that don't work currently on windows, note their failures, and expect it to stay the same...
This makes sense if you assume these are bugs. It is at least equally likely, if not more likely, that these are just badly written tests which need to be fixed. The exclusion is a todo list of tests to fix which also makes the CI work. I think hiding the never vetted Subversion-time tests behind expected failure will cause the tests to be never fixed (in addition to an increased runtime). The note about that is deep in the code and written in the way that the code looks right. To get the CI running, the exclude feature is sufficient. A proper solution requires the tests to be fixed (or removed if they turn out to be broken beyond repair). Having something in between which requires development and subsequent maintenance of new code is not worth it.
Generally, I would like to avoid adding new code to gunittest unless necessary. Seeking convergence with current unittest and/or pytest is preferred in my eyes.
...I managed to fix (not in a clean form yet) the errors that prevented successful discovery of the tests,
I still don't know what you are referring to here. The tests are discovered and run in the CI. The setup needed for that is in the YAML. What's wrong?
...either by some dynamic setup calls that were in the class fields rather than in the setup class, and various patching of env vars, like LD_LIBRARY_PATH (that's a discussion for another day, why are we the only Linux software that needs to do this in order to start), the PYTHONPATH, GISBASE, and lastly GISRC...
Doing this one the test framework level rather than test fixtures/setups was the decision I made for the gunittest and I don't want to follow that for pytest. One issue is a lot of code useful only for gunittest-based tests which needs to be maintained. Maybe we can have a call to discuss the session issues.
My other idea, unexplored yet since it would probably be some little overhead, would be to make a fixture that does that work of fiddling around what each part of grass needs set up in order to boot up...
I really recommend to purse this direction. That's how the current pytest tests are written if they need a GRASS session. If they don't need it, there is none. Fixtures are local to a directory which is a place which can certainly improve because some can be reusable. If the session setup needs to be better in any way, let's improve that. This will be then applicable outside of the tests.
Or another take on the GRASS session issue: The issue of managing GRASS sessions is real and there are things to fix and improve, but it is a general issue to solve for all use cases, not an issue which should be addressed at a testing framework level (except by using general code in combination with fixtures/setups which is a standard way how to use a testing framework).
I agree with your ideas. I'm interested in a call for brainstorming things out, but not right now, maybe in a month or so, to finish thinking about approaches in the background.
While I understand that from the "fixing things" perspective, an exclusion list is indeed the best choice. But now, what we have available, is file-wide exclusion (skipping, and not counted in stats), while I was finding a way to keep them running but specifically exclude the failing ones. I don't have a good solution yet to reconcile all positions.
To address the preference of no new code in gunittest or convergence with unittest, I've got two things. First, the decorators are pure unittest, that will stay supported. Second, I started to compare what the gunittest code contained and what is different from current unittest and what was current when it was written. I started off from one class for now, and was able to almost all simplify to a single additional function, by using class inheritance from standard library's unittest, and still keep the same behaviour. Would it be a step that is wanted to start with, to simplify a bit the code maintained?