gramine [PAL] Use non-encrypted pipes for `libos_pollable

Description of the feature

Gramine has a struct libos_pollable_event:

https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/include/libos_pollable_event.h
https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/libos_pollable_event.c

This struct is currently used only for threads and IPC worker:

https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/bookkeep/libos_thread.c#L62
https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/libos_async.c#L39

This object is used purely for wait/wakeup signaling (so that threads can sleep waiting on such events, e.g. a thread may wait when this event is triggered during epoll). This object does not transfer any data.

This object is currently implemented as PAL pipes: https://github.com/gramineproject/gramine/blob/ede508c69217477c1cd6fdb3e7689da824ba4ea7/libos/src/libos_pollable_event.c#L16

This implementation is expensive under the following conditions:

The app creates and destroys many threads during its execution (this was noticed on some AI/ML Python workloads).
The app runs under SGX (in this case, PAL pipes are TLS-encrypted and require an expensive TLS handshake at each thread creation).

Proposal

Create a micro-benchmark that creates several threads many-many times.
Analyze its SGX performance, make sure that the bottleneck is in the thread-creation (TLS handshake) phase.
If the above is true, come up with a performance fix, something like this:
- Introduce a new type of PAL pipes, or a new argument-option, or a new URI prefix: it will mark the pipe as unencrypted.
- Do not use TLS encryption for such typed pipes under Linux-SGX PAL.
- Analyze security of this fix (I think there are no security implications, as no data is transferred on such pipes).
Test the micro-benchmark and make sure the fix brings significant performance benefit.
Test Python AI/ML workloads (or any other workloads known to create bunches of threads over and over again -- maybe Go apps?).

May 22 '23 10:05 dimakuv

Maybe an important historic note: I originally wanted to reimplement libos_pollable_event as a completely in-enclave shared-memory LibOS-only object. In other words, libos_pollable_event would not use PAL pipes, or anything from PAL at all (well, maybe PAL futexes for sleep/wakeup).

But under further inspection, this turned out to be wrong/impossible. That's because these events are used in epoll (polling mechanism) alongside other events, so we would have to implement a corner case of epoll checking LibOS-only state of such pollable events. This would be too cumbersome and probably with bad performance.

So it's easier and cleaner to outsource the whole "when to sleep on the event and when to fire the event" logic to the host side, through PAL pipes.

May 22 '23 11:05 dimakuv

Create a micro-benchmark that creates several threads many-many times.

Approach:

creating as many as possible batches of threads (current batch size: 100 threads) within a fixed duration of benchmark (currently setup to 5s);
recording the CPU cycles spent on libos_syscall_clone() -> get_new_thread() -> create_pollable_event() -> TLS handshake related ops (e.g., _PalStreamSecureInit(), _PalThreadCreate(thread_handshake_func), pipe_session_key()) respectively;
calculating the prportion of spent cycles for:
- proportion1 = create_pollable_event() / get_new_thread()
- proportion2 = create_pollable_event() / libos_syscall_clone()
- and proportion3 = TLS handshake related ops / libos_syscall_clone()

Analyze its SGX performance, make sure that the bottleneck is in the thread-creation (TLS handshake) phase.

Test setup: commit: ede508c, release build

Some preliminary results on Gramine-SGX (average based on 40 test batches, unit: cycles):

Handshake ops	create_pollable_event	get_new_thread	libos_syscall_clone	proportion1	proportion2	proportion3
2052777	2262290	2279296	2871136	99.25%	78.79%	71.50%

For comparison, results on Gramine-direct (average based on 156 test batches, unit: cycles):

create_pollable_event	get_new_thread	libos_syscall_clone	proportion1	proportion2
120799	265390	753470	45.52%	16.03%

For a reasonable conservative estimation, we should be able to reduce the thread creation overhead by >50% (comparing to the current stats on Gramine-SGX) w/ the proposed optimization.

May 23 '23 04:05 kailun-qin

If the above is true, come up with a performance fix, something like this: Introduce a new type of PAL pipes, or a new argument-option, or a new URI prefix: it will mark the pipe as unencrypted. Do not use TLS encryption for such typed pipes under Linux-SGX PAL.

Draft implementation: https://github.com/gramineproject/gramine/pull/1371

Test the micro-benchmark and make sure the fix brings significant performance benefit.

Test setup: commit: ad393992a30b1383da1b017e4a6ccfa7d9feb27d, release build

Optimized results w/ PR https://github.com/gramineproject/gramine/pull/1371 (average based on 291 test batches, unit: cycles):

create_pollable_event	get_new_thread	libos_syscall_clone	proportion1	proportion2
121,254	132,965	394,642	91.19%	30.73%

For comparison (statistics obtained from the same setup):

Results on Gramine-SGX w/o optimization (average based on 63 test batches, unit: cycles):

create_pollable_event	get_new_thread	libos_syscall_clone	proportion1	proportion2
1,590,937	1,601,655	1,851,699	99.33%	85.92%

Results on Gramine-direct (average based on 506 test batches, unit: cycles):

create_pollable_event	get_new_thread	libos_syscall_clone	proportion1	proportion2
42,446	96,964	230,311	43.78%	18.43%

From the above stats, the fix does bring remarkable benefits.

May 24 '23 14:05 kailun-qin

[PAL] Use non-encrypted pipes for `libos_pollable_event`

Description of the feature

Proposal