[PAL] Use non-encrypted pipes for `libos_pollable_event`
Description of the feature
Gramine has a struct libos_pollable_event:
- https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/include/libos_pollable_event.h
- https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/libos_pollable_event.c
This struct is currently used only for threads and IPC worker:
- https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/bookkeep/libos_thread.c#L62
- https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/libos_async.c#L39
This object is used purely for wait/wakeup signaling (so that threads can sleep waiting on such events, e.g. a thread may wait when this event is triggered during epoll). This object does not transfer any data.
This object is currently implemented as PAL pipes: https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/libos_pollable_event.c#L16
This implementation is expensive under the following conditions:
- The app creates and destroys many threads during its execution (this was noticed on some AI/ML Python workloads).
- The app runs under SGX (in this case, PAL pipes are TLS-encrypted and require an expensive TLS handshake at each thread creation).
Proposal
- Create a micro-benchmark that creates several threads many-many times.
- Analyze its SGX performance, make sure that the bottleneck is in the thread-creation (TLS handshake) phase.
- If the above is true, come up with a performance fix, something like this:
- Introduce a new type of PAL pipes, or a new argument-option, or a new URI prefix: it will mark the pipe as
unencrypted. - Do not use TLS encryption for such typed pipes under Linux-SGX PAL.
- Analyze security of this fix (I think there are no security implications, as no data is transferred on such pipes).
- Introduce a new type of PAL pipes, or a new argument-option, or a new URI prefix: it will mark the pipe as
- Test the micro-benchmark and make sure the fix brings significant performance benefit.
- Test Python AI/ML workloads (or any other workloads known to create bunches of threads over and over again -- maybe Go apps?).
Maybe an important historic note: I originally wanted to reimplement libos_pollable_event as a completely in-enclave shared-memory LibOS-only object. In other words, libos_pollable_event would not use PAL pipes, or anything from PAL at all (well, maybe PAL futexes for sleep/wakeup).
But under further inspection, this turned out to be wrong/impossible. That's because these events are used in epoll (polling mechanism) alongside other events, so we would have to implement a corner case of epoll checking LibOS-only state of such pollable events. This would be too cumbersome and probably with bad performance.
So it's easier and cleaner to outsource the whole "when to sleep on the event and when to fire the event" logic to the host side, through PAL pipes.
Create a micro-benchmark that creates several threads many-many times.
Approach:
- creating as many as possible batches of threads (current batch size: 100 threads) within a fixed duration of benchmark (currently setup to 5s);
- recording the CPU cycles spent on
libos_syscall_clone()->get_new_thread()->create_pollable_event()-> TLS handshake related ops (e.g.,_PalStreamSecureInit(),_PalThreadCreate(thread_handshake_func),pipe_session_key()) respectively; - calculating the prportion of spent cycles for:
proportion1 = create_pollable_event() / get_new_thread()proportion2 = create_pollable_event() / libos_syscall_clone()- and
proportion3 = TLS handshake related ops / libos_syscall_clone()
Analyze its SGX performance, make sure that the bottleneck is in the thread-creation (TLS handshake) phase.
Test setup: commit: ede508c, release build
Some preliminary results on Gramine-SGX (average based on 40 test batches, unit: cycles):
| Handshake ops | create_pollable_event | get_new_thread | libos_syscall_clone | proportion1 | proportion2 | proportion3 |
|---|---|---|---|---|---|---|
| 2052777 | 2262290 | 2279296 | 2871136 | 99.25% | 78.79% | 71.50% |
For comparison, results on Gramine-direct (average based on 156 test batches, unit: cycles):
| create_pollable_event | get_new_thread | libos_syscall_clone | proportion1 | proportion2 |
|---|---|---|---|---|
| 120799 | 265390 | 753470 | 45.52% | 16.03% |
For a reasonable conservative estimation, we should be able to reduce the thread creation overhead by >50% (comparing to the current stats on Gramine-SGX) w/ the proposed optimization.
If the above is true, come up with a performance fix, something like this: Introduce a new type of PAL pipes, or a new argument-option, or a new URI prefix: it will mark the pipe as unencrypted. Do not use TLS encryption for such typed pipes under Linux-SGX PAL.
Draft implementation: https://github.com/gramineproject/gramine/pull/1371
Test the micro-benchmark and make sure the fix brings significant performance benefit.
Test setup: commit: ad393992a30b1383da1b017e4a6ccfa7d9feb27d, release build
Optimized results w/ PR https://github.com/gramineproject/gramine/pull/1371 (average based on 291 test batches, unit: cycles):
| create_pollable_event | get_new_thread | libos_syscall_clone | proportion1 | proportion2 |
|---|---|---|---|---|
| 121,254 | 132,965 | 394,642 | 91.19% | 30.73% |
For comparison (statistics obtained from the same setup):
- Results on Gramine-SGX w/o optimization (average based on 63 test batches, unit: cycles):
| create_pollable_event | get_new_thread | libos_syscall_clone | proportion1 | proportion2 |
|---|---|---|---|---|
| 1,590,937 | 1,601,655 | 1,851,699 | 99.33% | 85.92% |
- Results on Gramine-direct (average based on 506 test batches, unit: cycles):
| create_pollable_event | get_new_thread | libos_syscall_clone | proportion1 | proportion2 |
|---|---|---|---|---|
| 42,446 | 96,964 | 230,311 | 43.78% | 18.43% |
From the above stats, the fix does bring remarkable benefits.