qthreads
qthreads copied to clipboard
Performance comparison of Aarch64 context switch
Do a timing test to see the performance difference between using Aarch64 native context switching assembly vs the ucontext.h library.
Process I follow:
- Make change to
config/qthread_check_swapcontext.m4
file - Clean, configure, make, and install library
- Check library using
objdump -d <library file> | grep qt_get
for appropriate functions based on using or not using library - Compile stress test
make task_spawn
inside the tests/stress folder. The executable is placed inside the.libs
folder. - Do 3 timing tests
- Paste picture
Branch: PRTestBranchDec2021
Inside the file: config/qthread_check_swapcontext.m4
the parameter qt_host_based_enable_fastcontext
enables the code to select assembly if the values is set to 'yes' and ucontext.h library if the value is set to 'no'
Config:
To test for native ARMv8, search the library for qt_get
. Using the search term getcontext
applies when using the ucontext.h library:
Yes - this should apply equally to x86 and Power ISA.
@olivier-snl @janciesko any particular timing test to use?
Yes - for context switching, I'd use something like stress/taskspawn.c. Here, it might make sense to add an inner loop over a variable number of yields to control how many yields or context switches per task we do. Multiple context switches per task would hide task creation overheads. For locking, I would probably start with qthreads/test/stress/lock_acq_rel.c (PRTestBranchDec). It might make sense to parametrize a) number of locks and b) the distance between contented locks. In this way we do not benchmark cache coherence but the locking implementation itself.
Chris, if you have a moment, could you take a look at https://github.com/pmodels/argobots/tree/main/src/arch/fcontext. I'd be interested to know how this asm implementation compares to ours, just ideologically.
@olivier-snl passed along that repo to me. From what I can tell, they don't save as many general-purpose registers. They also save some floating-point registers. They also use the stack pointer for storing state while our code uses malloc-created (I believe) pointers. So it appears their code is swapping stack frames and preserving certain registers across the swap--appears to not be a full context switch. We are doing a full context switch, but don't preserve any FP registers.
Aarch64 ABI: https://developer.arm.com/documentation/ihi0055/latest
ARMv8 context switching: https://developer.arm.com/documentation/den0024/a/The-Memory-Management-Unit/Context-switching
ucontext.h library implementation: https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/aarch64/
@janciesko @olivier-snl Looking at the implementation, I notice the ucontext library does not appear to save registers X0 - X17. Not sure why.
another repo using native context switch: https://github.com/kaniini/libucontext/tree/master/arch/aarch64
After changing the line in the qthread_check_swapcontext.m4
file, I run:
Using ucontext library:
Using native ARMv8 code:
@janciesko @olivier-snl It seems the ARMv8 code can be faster, but has a higher deviation. This is on the login node, so I may need to run it on the computer node.
I merged 'main' into the 'fast-context' branch. I've ran a clean build and finished with a make check
. All tests pass except the qutil test. However, even that tests passes sometime. So there may be a minor bug, but not sure. I've ran about 20 times and if a test fails, it was always qutil.
I'll try to reproduce.
I added an image above to show my environment.