distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Frequently failing CI test `distributed/tests/test_tls_functional.py::test_nanny`

Open jacobtomlinson opened this issue 7 months ago • 1 comments

It looks like distributed/tests/test_tls_functional.py::test_nanny - asyncio.exceptions.TimeoutError is coming up a lot in CI.

https://github.com/dask/distributed/actions/runs/14521875247/job/40744515411 https://github.com/dask/distributed/actions/runs/14531807123/job/40772785073

jacobtomlinson avatar Apr 22 '25 10:04 jacobtomlinson

I briefly peaked into the issue.

  1. The timeout error is a red herring. The nanny isn't coming up and the gen_cluster fixture is retrying for a minute. After the minute, a timeout is raised, i.e. it tried starting the cluster at this point 60 times without success
  2. It looks like the scheduler is receiving a connection from the workers and is registering them even. At this point the scheduler already performed the handshake, i.e. bidirectional communication is possible.
  3. As soon as the scheduler starts listening to the comm (i.e. await comm.read()) again for any incoming messages, the connection is immediately closed. I'm running a CI job right now to get some info about the exception
  4. This is then tearing down the cluster and the cluster startup is retried until we're timing out

From what I can tell, these two tests are the only tests that test Nanny + TLS

fjetter avatar Apr 30 '25 09:04 fjetter