crystal [RFC] Improve the design of context switch

Currently, Fiber switching in crystal is done by invoking Fiber.swapcontext. However, it is hard to keep thread-safe during switching and return dead fiber back to stackpool. I have better idea to handle these problems.

Split the Fiber.swapcontext into suspend_func and resume_func

If use Fiber.swapcontext, the fiber stack and code segment will look like this :

Fiber Stack

hi |             :             |
   |             :             |
   |             :             |
   |             :             |
   +---------------------------+
   |       return address      |
   +---------------------------+ 
   |                           |
   |      stored registers     |
   |                           |
   +---------------------------+ <---- context.stack_top
   |             :             |
   |             :             |
lo |             :             |


Fiber.swapcontext:

    [save register]
    
    [swap stack pointer register]
                                  <---- context switch point
    [restore registers]
    
    [return]

If use suspend_func and resume_func, the stack and code segment will be as the following :

Fiber Stack

hi |             :             |
   |             :             |
   |             :             |
   |             :             |
   +---------------------------+
   |       return address      |
   +---------------------------+ 
   |                           |
   |      stored registers     |
   |                           |
   +---------------------------+
   |     resume_func address   |
   +---------------------------+ <---- context.stack_top
   |             :             |
   |             :             |
lo |             :             |

suspend_func:
    
    [save registers]
    
    [save resume_func address]
    
    [swap stack pointer register]
                                  <---- context switch point
    [return]

resume_func:

    [restore registers]
    
    [return]

Benefits

Although there are two more instructions in each context switch, we can make sure some operations can be done automatically in the future with some mechanism. For example, immediately unlock a mutex after switch.

code segment:
    mutex.lock
    target = get_targeted_fiber
    target.add_unlocker(mutex) ---------+
    target.switch                       |
                                        |
                                        |
                                        |
target fiber                            |
                                        |
hi |             :             |        |    |             :             |
   |             :             |        |    |             :             |
   |             :             |        |    +---------------------------+
   |             :             |        |    |       return address      |
   +---------------------------+        |    +---------------------------+
   |       return address      |        |    |      stored registers     |
   +---------------------------+        |    +---------------------------+
   |                           |        V    |     resume_func address   |
   |      stored registers     |------------>+---------------------------+
   |                           |             |       unlock function     |
   +---------------------------+             +---------------------------+
   |     resume_func address   |             |        mutex address      |
   +---------------------------+             +---------------------------+
   |             :             |             |  pop rdi then ret(x86-64) |
   |             :             |             +---------------------------+
lo |             :             |             |             :             |
   |             :             |             |             :             |

After context switching, the mutex will be unlocked automatically.

Conclusion

This is a tiny change but useful.

Sep 16 '19 19:09 firejox

it is hard to keep thread-safe during switching and return dead fiber back to stackpool.

Thanks for the detailed solution, but could you detail the problem too?

AFAIK, executing a callback when resuming a fiber can be done by the scheduler after a full context-switch, all in Crystal, not in ASM (repeated for each target).

Jan 08 '20 15:01 ysbaddaden

it is hard to keep thread-safe during switching and return dead fiber back to stackpool.

Thanks for the detailed solution, but could you detail the problem too?

If the scheduler is work-stealing or work-sharing, the current Channel and Mutex will not be thread-safe. Because their internal locks are unlocked before the fiber gets block, it is possible that two thread run on the same fiber. Hence, passing the unlock operation to the next fiber is a solution. (In Go, they use another way to resolve the problem.)

When the fiber is going to the end, we can also pass the cleanup operation to the next fiber. It can increase the reusability of stacks.

AFAIK, executing a callback when resuming a fiber can be done by the scheduler after a full context-switch, all in Crystal, not in ASM (repeated for each target).

Yes, it can do the same thing. I didn't choose callbacks because the callback as the below is confuse to me. There is no visible lock operation in the same file but unlock the lock.

https://github.com/crystal-lang/crystal/blob/9623919d7ddc61908ab9c10a38b9d9510b43caef/src/fiber.cr#L85-L88

Jan 09 '20 06:01 firejox