distributed
distributed copied to clipboard
Race condition in `SpecCluster.close` when closing while upscaling
There is a race condition in SpecCluster.close
that can lock potentially indefinitely if a cluster is closing while instances are spawned at the same time. It is not clear, yet, if this deadlock resolves itself given enough time.
This has been diagnosed as a root cause for some of the flaky tests, see https://github.com/dask/distributed/issues/4859#issuecomment-854705100
As a user I would expect an ongoing scale up attempt to be canceled during cluster closing and the closing to take care of cleaning up all already created instances.