distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Root-task withholding without co-assignment

Open fjetter opened this issue 2 years ago • 1 comments

We had an early attempt to experiment with root-task withholding to address the problem of root-task-overproduction. Below a couple of links with additional information (non-exhaustive)

  • https://github.com/dask/distributed/issues/6360
  • https://github.com/dask/distributed/issues/5223
  • https://github.com/dask/distributed/issues/5555

We started an experimentation trying to withhold worker assignment for root tasks, i.e. delay worker assignment scheduler side, see https://github.com/dask/distributed/issues/6560

Early prototypes show very promising results that should improve our cluster memory footprint. A prototype is available at https://github.com/dask/distributed/pull/6614 (and should be ready to try for curious users)

Given that the current co-assignment logic has some significant shortcomings (e.g. https://github.com/dask/distributed/issues/6597) and the withholding of root-tasks appears to be sufficient to control our memory footprint (some experimentation on configuration is still required) we should get the root-task withhold logic in a production ready, i.e. merge-able state and get rid of the current co-assignment logic.

This should be verified by thorough performance benchmark results, for this, see https://github.com/coiled/coiled-runtime/issues/191 for work on automated benchmarks.

Once this is solid, we may consider adding a more robust co-assignment logic in a follow up step, if necessary.

AC

  • The prototype PR is merged and the new assignment logic is hidden behind a feature toggle
  • The feature toggle is disabled by default
  • There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job.
  • A follow up ticket with an overview of all skipped tests is created

fjetter avatar Jun 24 '22 16:06 fjetter

#6614 currently implements this behind a feature flag. When the feature flag is turned off (current default), scheduling logic stays as-is, not only keeping co-assignment, but even fixing https://github.com/dask/distributed/issues/6597.

For this ticket, is root task withholding by default the goal, or do we just want to get it in behind a feature flag?

I imagine performance benchmarks will be an important part of answering this question, as well community input. But there's a also the question getting the entire test suite to pass under a new scheduling approach, and whether that's in scope or should be a follow-up task.

gjoseph92 avatar Jun 24 '22 16:06 gjoseph92

Reopening, since these still need to happen:

  • [ ] There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job. https://github.com/dask/distributed/pull/6989
  • [x] A follow up ticket with an overview of all skipped tests is created: https://github.com/dask/distributed/issues/6998

gjoseph92 avatar Aug 31 '22 15:08 gjoseph92