distributed
distributed copied to clipboard
Update CI stability
With a couple of recent merges, I triggered yesterday another "CI stress test" that runs our suite a couple of times in a row (this time 10)
see https://github.com/fjetter/distributed/tree/stress_ci which is based on https://github.com/dask/distributed/commit/dc019ed2398411549cdf738be4327a601ec7dfca with https://github.com/fjetter/distributed/commit/68689f0c9c41f0b1faf91888eea8d0be7745338a on top
The results of this test run can be seen https://github.com/fjetter/distributed/runs/7029246894?check_suite_focus=true
Summary
We had overall 80 total jobs spread on the different OSs and python versions of which 32 failed.
OS | success | failure | total | success rate |
---|---|---|---|---|
Ubuntu | 28 | 2 | 30 | 93% |
Windows | 13 | 17 | 30 | 43% |
OSX | 7 | 13 | 20 | 35% |
If we look at an entire test run, i.e. a full test matrix for a given run number, not a single job would've been successful.
Looking at the kinds of test failures, we will see that three jobs on windows failed due to a GH actions test timeout of 120s. And two test runs where cancelled by github without further information, also on windows. The timing out test runs do not have anything obvious in common. In fact, one of the three timed out tests appears to have finished running the pytest suite but still timed out.
A modified test report available here https://gistpreview.github.io/?ecc2cdddf651df9ee0c7966e210c9093
Does this test include #6591?
1.
I see that you ran on fjetter/distributed. I got a feeling that whenever I run on crusaderky/distributed I get a lot more random failures than on dask/distributed - likely because if you're in the free tier you have a higher chance of getting older/cheaper hardware?
2.
I strongly suspect a lot of failures may be related to https://github.com/dask/distributed/pull/6271. How many of the failed tests were using nannies? How many were using more than 2 nannies? CI hosts mount 2 CPUs each.
2b.
I think we could figure out a straightforward way to mass-skip all tests decorated with @gen_cluster(Worker=Nanny)
and see how many failures remain.
Does this test include https://github.com/dask/distributed/pull/6591?
yes
ikely because if you're in the free tier you have a higher chance of getting older/cheaper hardware?
the dask project is also on free tier, isn't it?
Most of the windows tests are failing because of a disk permission problem during cleanup. @graingert suggested that using the pytest fixtures instead of tempfile would help with this.
New update based on a8eb3b23b8fe91f52758db155e7151e3d516cbdc
using branch https://github.com/fjetter/distributed/tree/stress_ci
again running 10 iterations, see https://github.com/fjetter/distributed/commit/3fe45b401b9d31c50cb646f06421e067a6657fab
Test reports and summary was generated with code available here https://github.com/fjetter/distributed/commit/6be77905ce4e880e084b8095657a026fda99d409
Test report is a again available at https://gistpreview.github.io/?ecc2cdddf651df9ee0c7966e210c9093/a8eb3b23b8fe91f52758db155e7151e3d516cbdc.html
cancelled | failure | success | total | success_rate | |
---|---|---|---|---|---|
('macos-latest', '3.10') | 0 | 7 | 3 | 10 | 0.3 |
('macos-latest', '3.8') | 0 | 10 | 0 | 10 | 0 |
('ubuntu-latest', '3.10') | 0 | 0 | 10 | 10 | 1 |
('ubuntu-latest', '3.8') | 0 | 2 | 8 | 10 | 0.8 |
('ubuntu-latest', '3.9') | 0 | 0 | 10 | 10 | 1 |
('windows-latest', '3.10') | 0 | 5 | 5 | 10 | 0.5 |
('windows-latest', '3.8') | 1 | 1 | 8 | 10 | 0.8 |
('windows-latest', '3.9') | 3 | 2 | 5 | 10 | 0.5 |
On average, every full test matrix included 2.7 failures. There was no full test matrix successful.
We can observe, again, a couple of systematic problems
- CLI tests are timing out. They appear to try to connect to a scheduler that is not coming up. Judging from the logs, the scheduler is stuck in init somewhere
- A couple of attempts to schedule new coroutines after our new AsyncTaskGroup already closed, e.g.
distributed.deploy.tests.test_local.test_close_twice
- A couple of unclosed comms or unclosed clusters (likely an
with ...
missing in the individual tests) - A couple of leaked threads
- Nanny startup failures due to windows disk problems (e.g.
test_chaos_rechunk
https://github.com/dask/distributed/issues/6641) - Very likely still stuff I missed
New update based on https://github.com/dask/distributed/commit/e1b9e20fde946194858165a8b91cb94703c715aa
using branch https://github.com/fjetter/distributed/tree/stress_ci
High level summary
failure | success | cancelled | total | success_rate | |
---|---|---|---|---|---|
('macos-latest', '3.10') | 0 | 10 | 0 | 10 | 1 |
('macos-latest', '3.8') | 10 | 0 | 0 | 10 | 0 |
('ubuntu-latest', '3.10') | 3 | 7 | 0 | 10 | 0.7 |
('ubuntu-latest', '3.8') | 4 | 6 | 0 | 10 | 0.6 |
('ubuntu-latest', '3.9') | 2 | 8 | 0 | 10 | 0.8 |
('windows-latest', '3.10') | 0 | 10 | 0 | 10 | 1 |
('windows-latest', '3.8') | 0 | 10 | 0 | 10 | 1 |
('windows-latest', '3.9') | 2 | 8 | 0 | 10 | 0.8 |
We're performing already much better; with the exception of OSX py3.8 where not a single run was successful. Below a few detailed reports about the kinds of errors encountered
Detailed error report
This is a groupby on the truncated error message as a proxy for a fuzzy match
message_trunc | test | PR with possible fix |
---|---|---|
AssertionError: assert 3 == 1 | ['test_avoid_churn'] | |
AssertionError: assert False | ['test_restart_waits_for_new_workers'] | |
assert not b"Future excep | ['test_quiet_close_process[False]'] | https://github.com/dask/distributed/pull/6857 |
asyncio.exceptions.TimeoutErro | ['test_reconnect' 'test_shutdown_localcluster' 'test_wait_for_scheduler'] | |
failed on teardown with " | ['test_broken_worker' 'test_local_cluster_redundant_kwarg[True]'] | https://github.com/dask/distributed/pull/6865 or https://github.com/dask/distributed/pull/6863 |
pytest.PytestUnraisableExcepti | ['test_local_client_warning' 'test_release_retry' 'test_client_cluster_synchronous' 'test_run_spec' 'test_dont_select_closed_worker'] | https://github.com/dask/distributed/pull/6865 or https://github.com/dask/distributed/pull/6863 |
Full error messages
message_trunc | message | test |
---|---|---|
AssertionError: assert 3 == 1 | ['AssertionError: assert 3 == 1\n + where 3 = len(deque([(1660136972.0109015, {'n': 1, 'status': 'up'}), (1660136972.6130333, {'status': 'down', 'workers': [0]}), (1660136972.6989691, {'n': 1, 'status': 'up'})]))\n + where deque([(1660136972.0109015, {'n': 1, 'status': 'up'}), (1660136972.6130333, {'status': 'down', 'workers': [0]}), (1660136972.6989691, {'n': 1, 'status': 'up'})]) = <distributed.deploy.adaptive.Adaptive object at 0x000002420E9220A0>.log'] | ['test_avoid_churn'] |
AssertionError: assert False | ['AssertionError: assert False\n + where False = <bound method Set.isdisjoint of SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}))>(dict_keys(['tcp://127.0.0.1:33567', 'tcp://127.0.0.1:37071', 'tcp://127.0.0.1:38955', 'tcp://127.0.0.1:39951', 'tcp://127.0.0.1:46081']))\n + where <bound method Set.isdisjoint of SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}))> = SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})).isdisjoint\n + where SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})) = <bound method SortedDict.keys of SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})>()\n + where <bound method SortedDict.keys of SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})> = SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}).keys\n + where SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}) = <Scheduler 'tcp://127.0.0.1:37253', workers: 5, cores: 5, tasks: 0>.workers\n + and dict_keys(['tcp://127.0.0.1:33567', 'tcp://127.0.0.1:37071', 'tcp://127.0.0.1:38955', 'tcp://127.0.0.1:39951', 'tcp://127.0.0.1:46081']) = <built-in method keys of dict object at 0x7f139cf1f640>()\n + where <built-in method keys of dict object at 0x7f139cf1f640> = {'tcp://127.0.0.1:33567': <WorkerState 'tcp://127.0.0.1:33567', name: 2, status: closed, memory: 0, processing: 0>, 'tcp://127.0.0.1:37071': <WorkerState 'tcp://127.0.0.1:37071', name: 3, status: closed, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 1, status: closed, memory: 0, processing: 0>, 'tcp://127.0.0.1:39951': <WorkerState 'tcp://127.0.0.1:39951', name: 0, status: closed, memory: 0, processing: 0>, ...}.keys'] | ['test_restart_waits_for_new_workers'] |
assert not b"Future excep | ['assert not b"Future exception was never retrieved\nfuture: <Future finished exception=TimeoutError('Timeout')>\ntornado.util.TimeoutError: Timeout\n"'] | ['test_quiet_close_process[False]'] |
asyncio.exceptions.TimeoutErro | ['asyncio.exceptions.TimeoutError'] | ['test_reconnect' 'test_shutdown_localcluster' 'test_wait_for_scheduler'] |
failed on teardown with " | ['failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object TCP.write at 0x7fe0dce7d0e0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'TCP.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f47a9344240>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f9e60358b40>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f139614b940>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object TCP.write at 0x7f3f7afa7c40>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'TCP.write' was never awaited"'] | ['test_broken_worker' 'test_local_cluster_redundant_kwarg[True]'] |
pytest.PytestUnraisableExcepti | ['pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f776c0c2c00>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7ff32edb3b50>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f3c97bdccc0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object TCP.write at 0x7f3c988d12c0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'TCP.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f37f936aac0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited'] | ['test_local_client_warning' 'test_release_retry' 'test_client_cluster_synchronous' 'test_run_spec' 'test_dont_select_closed_worker'] |
The timeout errors in row three appear to throw a couple of distributed.core.AsyncTaskGroupClosedError: Cannot schedule a new coroutine function as the group is already closed.
errors