grass-addons [Bug] Tests timeouts 6h+, maybe r.futures.calib hangs

Some tests are hanging and taking 6+ hours before getting killed when running on CI.

Name of the addon The name of the GRASS GIS addon. The last line of the log shown is r.futures.calib, but maybe the problem is elsewhere.

Describe the bug A clear and concise description of what the bug of the addon is.

A couple mornings this week, plus this weekend, I manually stopped some jobs in grass-addons as they were on their way to taking 6h.

To Reproduce Steps to reproduce the behavior:

See the logs for build and test, most often (if not only) of the hanging jobs are for main+3.11.

Go to '...'
Click on '....'
Compute '....'
See error

Expected behavior A clear and concise description of what you expected to happen.

Tests are fast, or at least, not hanging.

Screenshots If applicable, add screenshots to help explain your problem.

System description (please complete the following information):

On CI runners.

Operating System: [e.g. Windows, Linux, ... incl. version]
GRASS GIS version [e.g. 7.8.1]

Additional context Add any other context about the problem here.

I don't think my work last weekend is the cause, as some of the PRs weren't timing out, but some did. Maybe I just allowed more tests to pass (with more dependencies or better translatable strings) and we end up with hanging tests. There were some times out in the last weeks before my last work here, but not as constant as this last week. (Even before the caching. One day I tried deleting all cache here, and the next daily run timed out too).

It may be a while loop (that I didn't find in the first two r.futures tests and code), but there's a lot of nested for loops).

It could be a case where there is some input expected and it's waiting, but it's more common with the cmd.exe on windows.

Could a stdin/stdout/stderr be closed and the gunittest test runner doesn't handle it well?

We can give a try with 8.4.0 with Python 3.11 or the 8.5.0dev with the Python version used in the 8.4.0 job (if it is 3.9+), to see if it is a Python problem, or a regression in the development grass in the last month(s) or so.

Feb 14 '25 19:02 echoix

@petrasovaa I saw that you were involved in some r.futures work. Do you know if there's something there to check out?

Feb 14 '25 19:02 echoix

On my fork, I saw some timeouts 3 months ago and 2 weeks ago, but for 8.4.0. So maybe not a regression but something from addons specifically.

I also see that r.random.walk could be really long too. https://github.com/echoix/grass-addons/actions/runs/13048290473/job/36402778863

Are we hitting a memory/swap limit? The job could get killed in that case no?

Feb 14 '25 19:02 echoix

@petrasovaa I saw that you were involved in some r.futures work. Do you know if there's something there to check out?

Well, it is a non-trivial model, so it's possible something is wrong, but where do you even see it and how often it happens?

Feb 14 '25 19:02 petrasovaa

Pick any one of the 6h ones of https://github.com/OSGeo/grass-addons/actions/workflows/ci.yml?query=is%3Afailure Or some cancelled of this week: https://github.com/OSGeo/grass-addons/actions/workflows/ci.yml?query=is%3Acancelled

Feb 14 '25 19:02 echoix

A harder to debug situation would be if this and/or r.random.walk are stochastic/gradient descent and they don't converge, ie, convergence depends on the initial state or seed.

Feb 14 '25 19:02 echoix

We could always limit the test step at 45 minutes and save ourselves 5 hours of CI time in the European mornings

Feb 14 '25 19:02 echoix

Limit sounds good, but the per test limit seems more informative for s failures.

Feb 15 '25 14:02 wenzeslaus

Limit sounds good, but the per test limit seems more informative for s failures.

We might as well use both, as if we're effectively hung, the timeout might not work

Feb 15 '25 15:02 echoix

I've got to a point where I'm able to run successfully r.futures.calib and r.futures.parallelpga with pytest, but not with gunittest, as r.futures.calib still hangs with gunittest.

I overcame one of our previous limitations of using pytest: I managed to write an autouse fixture that is enabled when the test directory is "testsuite", and contains a data directory. If so, it copies the data directory to a pytest tmp directory, and changes the current working directory with monkeypatch (so it gets reverted), and it really works.

A problem, is that since it is not pytest tests, the setUpClass() methods get executed before the autouse fixture can change the directory, so I changed to setUp() instead (per test), and it is still way faster than waiting for timeouts.

If we find a way to have the data directory copied for a setUpClass(), we would likely have everything solved for running everything with pytest (even in the main repo)

Feb 16 '25 22:02 echoix