faasm UserFaultFD for much better page fault SMP scaling and delta snapshot changed pages tracking

This might be more relevant to me, but I know you also do soft-dirty tracking which uffd replaces in a much nicer way, so just putting a bunch of resources here.

Official docs:
https://www.kernel.org/doc/html/latest/admin-guide/mm/userfaultfd.html
https://man7.org/linux/man-pages/man2/userfaultfd.2.html
https://man7.org/linux/man-pages/man2/ioctl_userfaultfd.2.html
Redhat presentation:
https://www.linux-kvm.org/images/1/10/01Wed-1415-LinuxCON-aarcangeli-userfaultfd.pdf
Annotated example program:
https://blog.lizzie.io/using-userfaultfd.html

For my usecase, apart from delta tracking, I think making all WAVM memory areas handled by a uffd that gets read from multiple threads should massively reduce kernel lock contention in heavily-multithreaded, short-lived function instances. The ioctls are much more efficient than mprotect calls due to not taking a VMA lock in the pagefault handler. I imagine making proper use of this will require changes to wavm to hook into its memory management (where it does mmap/mprotect/munmap) and replace it with a faasm-controlled userspace system.

Oct 30 '21 13:10 eigenraven

Interesting, thanks for sharing. Clearly this would yield some improvements, but is it possible to quantify how big those improvements would be? I see the word "massively" again, any idea how massively it would be massive 😄?

Obviously not saying it's not worth doing, just have to be tactical about when and what to change with limited resources.

Nov 01 '21 09:11 Shillaker

I will be running some microbenchmarks soon to figure out how good it actually is compared to what the authors say in the faasm usecase

Nov 01 '21 09:11 eigenraven

Awesome, will be very interested to see the results.

Nov 01 '21 09:11 Shillaker

Microbenchmark results:

For reference:

BM_NoProtection - the case with no memory protection at all, the full 4GiB of the 8GiB virtual allocation are PROT_READ|PROT_WRITE all the time
BM_FullNewMapping - if mmap is called on the entire 4GiB range every iteration
BM_ReuseMapping - what faasm (mostly, exactly what my fork does) does now: map once, resize with mprotect when increasing size and small mmap FIXED 0 pages with PROT_NONE when shrinking
BM_ReuseAndDontneed - experimenting with using DONT_NEED instead of mmap/mprotect overheads, can be mostly ignored
BM_UFFD_Eager - UFFD + DONT_NEED, restore snapshot with memcpy
BM_UFFD_Lazy - UFFD + DONT_NEED, restore snapshot on page faults
BM_UFFD_SIGBUS_ - like above, but page faults trigger sigbus which is handled by the same thread without extra two context switches between threads (like the fast case in https://xzpeter.org/userfaultfd-wp-latency-measurements/)

At 64 threads executing concurrently, the CPU utilization is:

NoProtection: 97%
FullNewMapping: 48%
ReuseMapping: 58%
UFFD Lazy: 90%
UFFD Eager: 97%
UFFD SIGBUS: 96%

The values are so low for full new mapping and reuse mapping because of process memory map write semaphore contention at https://code.woboq.org/linux/linux/mm/mprotect.c.html#484 and similar for mmap. madvise(DONT_NEED) which evicts pages from the page table and UFFD population ioctls both only take a read lock - which can be shared between threads. Out-of-bounds accesses are still easily caught by the UFFD method, by manually sending a SIGSEGV to the offending thread from the SIGBUS handler.

Overhead of UFFD is large, except when using SIGBUS instead of separate threads it is only 5.8% for the singlethreaded case in the microbenchmark which is mostly pagefaulting and doing nothing else. Lazy loading of snapshots via UFFD_COPY ioctls has an overhead of 37% compared to a memcpy with the current solution, so most likely it's not worth it unless snapshots become very sparse.

The full benchmark source is at https://github.com/auto-ndp/ndp-auto-offload/tree/main/faultspeed

Permalink at time of comment writing: https://github.com/auto-ndp/ndp-auto-offload/tree/3a0c66d47f7bded28b19584ecfc054b1a64f29ce/faultspeed

Full results:

multithreaded singlethreaded overhead

Nov 03 '21 15:11 eigenraven

AFAICT this was closed with faasm/faabric#232 (re-open if I am wrong)

Nov 29 '22 13:11 csegarragonz

faasm faasm copied to clipboard

UserFaultFD for much better page fault SMP scaling and delta snapshot changed pages tracking

faasm
faasm copied to clipboard