Dagger.jl icon indicating copy to clipboard operation
Dagger.jl copied to clipboard

Determine work assignment and data movement based on runtime-collected metrics

Open jpsamaroo opened this issue 5 years ago • 1 comments

While round-robin work assigment is fine when first launching work without any prior knowledge, it is less efficient when individual work items are widely varying in duration, and when data being moved varies in size. We have the necessary infrastructure already built-in to Dagger to support monitoring work and data movement latencies, we just need to tell the scheduler how to use this information for its benefit.

I believe that we could use a simple runtime-derived cost model, plus a numerical optimizer, to allow the scheduler to make better decisions. We can also add information about processor hierarchies to further refine the model to capture latencies due to memory transfer between levels of the processor hierarchy (e.g. NUMA domains, CPU-GPU transfers, disk-backed access latency, etc.).

jpsamaroo avatar May 06 '20 17:05 jpsamaroo

Recent work by @stevengj might be useful here: https://arxiv.org/abs/2003.04287

ChrisRackauckas avatar May 14 '20 19:05 ChrisRackauckas

This was implemented a while back, so closing.

jpsamaroo avatar Feb 25 '23 00:02 jpsamaroo