Change push+ret in set_context to indirect jmp
This snippet of code in your set_context subroutine:
pushq %r8
xorl %eax, %eax
ret
should be changed to:
xorl %eax, %eax
jmp *%r8
And likewise with swap_context.
Modern Intel and AMD CPU microarchitectures have a return stack buffer (RSB) that tracks call and ret invocations so they can speculatively execute past a ret instruction. A mispredicted ret will cause a guaranteed pipeline stall, which will seriously hurt your performance. By contrast, jmp *%r8 is speculated using the indirect branch predictor, which is likely to have a non-zero hit rate.
I can confirm that in my tests on i5 650 (of just swapping between two functions on one pinned thread and counting), jmp makes the entire function 50% faster
https://blog.stuffedcow.net/2018/04/ras-microbenchmarks/