distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Race condition in `SpecCluster.close` when closing while upscaling

Open fjetter opened this issue 2 years ago • 0 comments

There is a race condition in SpecCluster.close that can lock potentially indefinitely if a cluster is closing while instances are spawned at the same time. It is not clear, yet, if this deadlock resolves itself given enough time.

This has been diagnosed as a root cause for some of the flaky tests, see https://github.com/dask/distributed/issues/4859#issuecomment-854705100

As a user I would expect an ongoing scale up attempt to be canceled during cluster closing and the closing to take care of cleaning up all already created instances.

fjetter avatar Feb 28 '22 15:02 fjetter