distributed
distributed copied to clipboard
Remove idle and saturated sets from scheduler
This is an attempt at removing the idle
/saturated
/idle_task_count
containers on the scheduler.
- They are expensive to maintain. They inherently rely on occupancy which is a quantity that cannot be accurately maintained in constant time. We therefore fell back a while ago to decompose occupancy into various quantities that can be maintained online but the occupancy itself has to be computed on demand (scaling linearily with the number of task prefixes). This can be expensive considering how often
check_idle_saturated
has to be evaluated. - Their definition is arguably fragile and hard to grasp. Particularly since they rely on the also quite fragile quantity
occupancy
which historically has put network load on equal footing to compute load which caused plenty of confusion in the past. It also makes reliable testing notoriously difficult - Ultimately, these sets are just maintained for performance reasons. Apart from performance optimization, their only functional purpose is to control the color of the worker bars. Saturated workers (a state that is incredibly difficult to attain) are colored yellow. That's it.
While ripping this out I simplified the stealing code and eradicated a couple of sources of non-determinism
The most important change that is also causing some of the tests to fail is that in the current iteration I chose to define possible thieves on the basis of the number of threads that are available. Previously, all idle classified workers, i.e. all workers with less than half of average occupancy, were considered possible thieves. Therefore, this stealing code is much, much more conservative. That is much more reliable and predictable but is also much less aggressive and cannot enforce absolute homogeneity.
The lack of determinism often originate from the usage of sets or the lack of tie-breakers when sorting. I believe that even without removing the idle/saturated sets it makes sense to remove those sources of non-determinism. Particularly, since this can subtly also affect global task ordering.
I still have to run some actual tests on this. So far this is all rather theoretical