faasm
faasm copied to clipboard
UserFaultFD for much better page fault SMP scaling and delta snapshot changed pages tracking
This might be more relevant to me, but I know you also do soft-dirty tracking which uffd replaces in a much nicer way, so just putting a bunch of resources here.
- Official docs:
- https://www.kernel.org/doc/html/latest/admin-guide/mm/userfaultfd.html
- https://man7.org/linux/man-pages/man2/userfaultfd.2.html
- https://man7.org/linux/man-pages/man2/ioctl_userfaultfd.2.html
- Redhat presentation:
- https://www.linux-kvm.org/images/1/10/01Wed-1415-LinuxCON-aarcangeli-userfaultfd.pdf
- Annotated example program:
- https://blog.lizzie.io/using-userfaultfd.html
For my usecase, apart from delta tracking, I think making all WAVM memory areas handled by a uffd that gets read from multiple threads should massively reduce kernel lock contention in heavily-multithreaded, short-lived function instances. The ioctls are much more efficient than mprotect calls due to not taking a VMA lock in the pagefault handler. I imagine making proper use of this will require changes to wavm to hook into its memory management (where it does mmap/mprotect/munmap) and replace it with a faasm-controlled userspace system.
Interesting, thanks for sharing. Clearly this would yield some improvements, but is it possible to quantify how big those improvements would be? I see the word "massively" again, any idea how massively it would be massive 😄?
Obviously not saying it's not worth doing, just have to be tactical about when and what to change with limited resources.
I will be running some microbenchmarks soon to figure out how good it actually is compared to what the authors say in the faasm usecase
Awesome, will be very interested to see the results.
Microbenchmark results:
For reference:
- BM_NoProtection - the case with no memory protection at all, the full 4GiB of the 8GiB virtual allocation are PROT_READ|PROT_WRITE all the time
- BM_FullNewMapping - if mmap is called on the entire 4GiB range every iteration
- BM_ReuseMapping - what faasm (mostly, exactly what my fork does) does now: map once, resize with mprotect when increasing size and small mmap FIXED 0 pages with PROT_NONE when shrinking
- BM_ReuseAndDontneed - experimenting with using DONT_NEED instead of mmap/mprotect overheads, can be mostly ignored
- BM_UFFD_Eager - UFFD + DONT_NEED, restore snapshot with memcpy
- BM_UFFD_Lazy - UFFD + DONT_NEED, restore snapshot on page faults
- BM_UFFD_SIGBUS_ - like above, but page faults trigger sigbus which is handled by the same thread without extra two context switches between threads (like the fast case in https://xzpeter.org/userfaultfd-wp-latency-measurements/)
At 64 threads executing concurrently, the CPU utilization is:
- NoProtection: 97%
- FullNewMapping: 48%
- ReuseMapping: 58%
- UFFD Lazy: 90%
- UFFD Eager: 97%
- UFFD SIGBUS: 96%
The values are so low for full new mapping and reuse mapping because of process memory map write semaphore contention at https://code.woboq.org/linux/linux/mm/mprotect.c.html#484 and similar for mmap. madvise(DONT_NEED)
which evicts pages from the page table and UFFD population ioctls both only take a read lock - which can be shared between threads. Out-of-bounds accesses are still easily caught by the UFFD method, by manually sending a SIGSEGV to the offending thread from the SIGBUS handler.
Overhead of UFFD is large, except when using SIGBUS instead of separate threads it is only 5.8% for the singlethreaded case in the microbenchmark which is mostly pagefaulting and doing nothing else. Lazy loading of snapshots via UFFD_COPY ioctls has an overhead of 37% compared to a memcpy with the current solution, so most likely it's not worth it unless snapshots become very sparse.
The full benchmark source is at https://github.com/auto-ndp/ndp-auto-offload/tree/main/faultspeed
Permalink at time of comment writing: https://github.com/auto-ndp/ndp-auto-offload/tree/3a0c66d47f7bded28b19584ecfc054b1a64f29ce/faultspeed
Full results:
AFAICT this was closed with faasm/faabric#232 (re-open if I am wrong)