may [rfc] new scheduler

The current scheduler has the following problems:

it needs configuration (workers, io_workers, run_on_io): it would be nicer if user does not need to configure anything and yet get maximum performance all the time
it has a global ready list: scaling to anything other than a few cores and this will become a bottleneck
timer thread is also global: another point of contention
event loop is on a separate thread Both 3) and 4) cause unnecessary OS context switches whenever there is a timer expiry or I/O poll

I propose the following design for a new scheduler which I plan to implement. This is a request for comments. My understanding of may is not that deep so it is possible that some things won't work :-)

may has N schedulers (S) where N is the number of CPUs of the machine
each S runs on its own kernel thread
each S has a single threaded eventloop (coros waiting for io) and a timerlist (coros waiting on timer)
each S has its own readylist which contains coroutines ready to run (crossbeam-deque: work-stealing)
each S has its own yieldlist (coros that yielded and can resume immediately)
may has a single list of parked threads (parked)
may has a counter of stealing threads (num_stealing)

The scheduling loop will look like this:

loop {
  // when yieldlist is not empty, this means we have coroutines the yielded but are ready to run.
  // as such we do need to check the eventloop and return as fast as possible
  let timeout = if !sched.yieldlist.empty() { 0 } else { sched.next_deadline() - now() };
  while let Some(co) = yieldlist.pop_back() {
    sched.readylist.push(co);
  }
  // select moves ready coroutines to the local readylist. each ready coroutine is pushed to the front
  sched.eventloop.select(timeout);
  // if we have more than 1 coroutine ready to run, and there are no threads stealing,
  // and we have parked threads, unpark one to increase parallelism
  if sched.readylist.len() > 1 && num_stealing.load(Ordering::Acquire) == 0 && !parked.empty() {
    if let Some(t) = parked.pop() {
      t.unpark();
    }
  }
  // run all coros until readylist is empty
  while let Some(co) = sched.readylist.pop() {
    run_coroutine(co);
  }
  // we have cpu bound coroutines that yielded, restart the loop to make more coroutines ready/expire timers, etc.
  if !select.yieldlist.empty() {
    continue;
  }
  // we have no ready coroutines to run. time to steal!
  assert!(sched.readlist.empty());
  num_stealing.fetch_add(1, Ordering::Release);
  // see implementation below
  if sched.steal() {
    continue;
  }
  // we didn't manage to steal anything, which means there is nothing to do.
  num_stealing.fetch_sub(1, Ordering::Release);
  parked.push(thread::current());
  thread::park()
}

To make stealing fast and avoid spurious park()/unpark() we spin for a while trying to steal and then give up.

fn steal(&mut self) {
  let deadline = now() + 100ms;  // needs tuning
  loop {
    let id = rand() % N;  // random victim
    if id == self.id {  // stealing from ourselves is silly :-p
      continue;
    } else {
      let stolen = loop {
        match schedules[id].readylist.steal() {
          Steal::Empty => break None,
          Steal::Data(co) => break Some(co),
          Steal::Retry => {},
      };
      if let Some(co)  = stolen {
        self.readylist.push(co);
        return true;
      }
    }
    if now() > deadline {
      return false;
    }
  }
}

Jan 27 '18 14:01 alkis

thanks for the proposal! Here I list some comments according to your input.

it needs configuration (workers, io_workers, run_on_io): it would be nicer if user does not need to configure anything and yet get maximum performance all the time

we still need configuration to config the coroutine stack size, thus can't avoid the config totally. And it's not a bad thing that user can adjust and tune for their own application.

it has a global ready list: scaling to anything other than a few cores and this will become a bottleneck

yes, there is contention point here and this is not scale well. I used to have a queue for each worker thread, but the result is not as good as I expected. It need to steal all other worker thread queues, which is a little time consuming, but maybe my implementation was not qualified. We can improve here definitely 😄

timer thread is also global: another point of contention

the global timer thread just throw the timed out coroutines into the ready list. It's not running any coroutines. the delay is not significant important here since the coroutine is already timed out.

event loop is on a separate thread Both 3) and 4) cause unnecessary OS context switches whenever there is a timer expiry or I/O poll

io worker thread has it's own time out checking, thus not contention with any other threads. and running coroutines directly on it's own, not push them to the ready list, thus there is no contention here.

It's good to have a more advanced scheduler. I'd like to see that happen, and let's test and compare if it's better.

Jan 28 '18 00:01 Xudong-Huang

FYI WIP here: https://github.com/alkis/may/tree/work-stealing-scheduler

Jan 28 '18 23:01 alkis

may may copied to clipboard

[rfc] new scheduler

may
may copied to clipboard