distributed
distributed copied to clipboard
Frequently failing CI test `distributed/tests/test_tls_functional.py::test_nanny`
It looks like distributed/tests/test_tls_functional.py::test_nanny - asyncio.exceptions.TimeoutError is coming up a lot in CI.
https://github.com/dask/distributed/actions/runs/14521875247/job/40744515411 https://github.com/dask/distributed/actions/runs/14531807123/job/40772785073
I briefly peaked into the issue.
- The timeout error is a red herring. The nanny isn't coming up and the
gen_clusterfixture is retrying for a minute. After the minute, a timeout is raised, i.e. it tried starting the cluster at this point 60 times without success - It looks like the scheduler is receiving a connection from the workers and is registering them even. At this point the scheduler already performed the handshake, i.e. bidirectional communication is possible.
- As soon as the scheduler starts listening to the comm (i.e.
await comm.read()) again for any incoming messages, the connection is immediately closed. I'm running a CI job right now to get some info about the exception - This is then tearing down the cluster and the cluster startup is retried until we're timing out
From what I can tell, these two tests are the only tests that test Nanny + TLS