gramine icon indicating copy to clipboard operation
gramine copied to clipboard

[PAL] Use non-encrypted pipes for `libos_pollable_event`

Open dimakuv opened this issue 2 years ago • 3 comments

Description of the feature

Gramine has a struct libos_pollable_event:

  • https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/include/libos_pollable_event.h
  • https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/libos_pollable_event.c

This struct is currently used only for threads and IPC worker:

  • https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/bookkeep/libos_thread.c#L62
  • https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/libos_async.c#L39

This object is used purely for wait/wakeup signaling (so that threads can sleep waiting on such events, e.g. a thread may wait when this event is triggered during epoll). This object does not transfer any data.

This object is currently implemented as PAL pipes: https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/libos_pollable_event.c#L16

This implementation is expensive under the following conditions:

  1. The app creates and destroys many threads during its execution (this was noticed on some AI/ML Python workloads).
  2. The app runs under SGX (in this case, PAL pipes are TLS-encrypted and require an expensive TLS handshake at each thread creation).

Proposal

  • Create a micro-benchmark that creates several threads many-many times.
  • Analyze its SGX performance, make sure that the bottleneck is in the thread-creation (TLS handshake) phase.
  • If the above is true, come up with a performance fix, something like this:
    • Introduce a new type of PAL pipes, or a new argument-option, or a new URI prefix: it will mark the pipe as unencrypted.
    • Do not use TLS encryption for such typed pipes under Linux-SGX PAL.
    • Analyze security of this fix (I think there are no security implications, as no data is transferred on such pipes).
  • Test the micro-benchmark and make sure the fix brings significant performance benefit.
  • Test Python AI/ML workloads (or any other workloads known to create bunches of threads over and over again -- maybe Go apps?).

dimakuv avatar May 22 '23 10:05 dimakuv

Maybe an important historic note: I originally wanted to reimplement libos_pollable_event as a completely in-enclave shared-memory LibOS-only object. In other words, libos_pollable_event would not use PAL pipes, or anything from PAL at all (well, maybe PAL futexes for sleep/wakeup).

But under further inspection, this turned out to be wrong/impossible. That's because these events are used in epoll (polling mechanism) alongside other events, so we would have to implement a corner case of epoll checking LibOS-only state of such pollable events. This would be too cumbersome and probably with bad performance.

So it's easier and cleaner to outsource the whole "when to sleep on the event and when to fire the event" logic to the host side, through PAL pipes.

dimakuv avatar May 22 '23 11:05 dimakuv

Create a micro-benchmark that creates several threads many-many times.

Approach:

  • creating as many as possible batches of threads (current batch size: 100 threads) within a fixed duration of benchmark (currently setup to 5s);
  • recording the CPU cycles spent on libos_syscall_clone() -> get_new_thread() -> create_pollable_event() -> TLS handshake related ops (e.g., _PalStreamSecureInit(), _PalThreadCreate(thread_handshake_func), pipe_session_key()) respectively;
  • calculating the prportion of spent cycles for:
    • proportion1 = create_pollable_event() / get_new_thread()
    • proportion2 = create_pollable_event() / libos_syscall_clone()
    • and proportion3 = TLS handshake related ops / libos_syscall_clone()

Analyze its SGX performance, make sure that the bottleneck is in the thread-creation (TLS handshake) phase.

Test setup: commit: ede508c, release build

Some preliminary results on Gramine-SGX (average based on 40 test batches, unit: cycles):

Handshake ops create_pollable_event get_new_thread libos_syscall_clone proportion1 proportion2 proportion3
2052777 2262290 2279296 2871136 99.25% 78.79% 71.50%

For comparison, results on Gramine-direct (average based on 156 test batches, unit: cycles):

create_pollable_event get_new_thread libos_syscall_clone proportion1 proportion2
120799 265390 753470 45.52% 16.03%

For a reasonable conservative estimation, we should be able to reduce the thread creation overhead by >50% (comparing to the current stats on Gramine-SGX) w/ the proposed optimization.

kailun-qin avatar May 23 '23 04:05 kailun-qin

If the above is true, come up with a performance fix, something like this: Introduce a new type of PAL pipes, or a new argument-option, or a new URI prefix: it will mark the pipe as unencrypted. Do not use TLS encryption for such typed pipes under Linux-SGX PAL.

Draft implementation: https://github.com/gramineproject/gramine/pull/1371

Test the micro-benchmark and make sure the fix brings significant performance benefit.

Test setup: commit: ad393992a30b1383da1b017e4a6ccfa7d9feb27d, release build

Optimized results w/ PR https://github.com/gramineproject/gramine/pull/1371 (average based on 291 test batches, unit: cycles):

create_pollable_event get_new_thread libos_syscall_clone proportion1 proportion2
121,254 132,965 394,642 91.19% 30.73%

For comparison (statistics obtained from the same setup):

  • Results on Gramine-SGX w/o optimization (average based on 63 test batches, unit: cycles):
create_pollable_event get_new_thread libos_syscall_clone proportion1 proportion2
1,590,937 1,601,655 1,851,699 99.33% 85.92%
  • Results on Gramine-direct (average based on 506 test batches, unit: cycles):
create_pollable_event get_new_thread libos_syscall_clone proportion1 proportion2
42,446 96,964 230,311 43.78% 18.43%

From the above stats, the fix does bring remarkable benefits.

kailun-qin avatar May 24 '23 14:05 kailun-qin