distributed
distributed copied to clipboard
Root-task withholding without co-assignment
We had an early attempt to experiment with root-task withholding to address the problem of root-task-overproduction. Below a couple of links with additional information (non-exhaustive)
- https://github.com/dask/distributed/issues/6360
- https://github.com/dask/distributed/issues/5223
- https://github.com/dask/distributed/issues/5555
We started an experimentation trying to withhold worker assignment for root tasks, i.e. delay worker assignment scheduler side, see https://github.com/dask/distributed/issues/6560
Early prototypes show very promising results that should improve our cluster memory footprint. A prototype is available at https://github.com/dask/distributed/pull/6614 (and should be ready to try for curious users)
Given that the current co-assignment logic has some significant shortcomings (e.g. https://github.com/dask/distributed/issues/6597) and the withholding of root-tasks appears to be sufficient to control our memory footprint (some experimentation on configuration is still required) we should get the root-task withhold logic in a production ready, i.e. merge-able state and get rid of the current co-assignment logic.
This should be verified by thorough performance benchmark results, for this, see https://github.com/coiled/coiled-runtime/issues/191 for work on automated benchmarks.
Once this is solid, we may consider adding a more robust co-assignment logic in a follow up step, if necessary.
AC
- The prototype PR is merged and the new assignment logic is hidden behind a feature toggle
- The feature toggle is disabled by default
- There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job.
- A follow up ticket with an overview of all skipped tests is created
#6614 currently implements this behind a feature flag. When the feature flag is turned off (current default), scheduling logic stays as-is, not only keeping co-assignment, but even fixing https://github.com/dask/distributed/issues/6597.
For this ticket, is root task withholding by default the goal, or do we just want to get it in behind a feature flag?
I imagine performance benchmarks will be an important part of answering this question, as well community input. But there's a also the question getting the entire test suite to pass under a new scheduling approach, and whether that's in scope or should be a follow-up task.
Reopening, since these still need to happen:
- [ ] There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job. https://github.com/dask/distributed/pull/6989
- [x] A follow up ticket with an overview of all skipped tests is created: https://github.com/dask/distributed/issues/6998