gramine RFC: SGX2 support in Gramine (Phase 1)

SGX1 instruction set requires all enclave memory to be committed at enclave build time. It also requires the developer to predict and use maximum heap and stack sizes in the enclave build. Likewise, additional code modules cannot be dynamically loaded into the enclave environment after enclave build. This increases enclave build time and limits the enclave’s ability to adapt to changing workloads.

Additionally, page protections cannot be changed for an enclave memory. Executable code containing relocations must be loaded as Read, Write, and Execute (RWX) and remain that way for the life of the enclave. This also limits the capabilities of garbage collectors and dynamic translators or just-in-time (JIT) compilers with the enclave.

SGX2 instruction set was designed to overcome these limitations. SGX2 Extensions give the software the ability to dynamically add and remove pages from an enclave and to manage the attributes of enclave pages.

This RFC focuses on adding support for 2 key features,

Enclave dynamic memory management (EDMM) or in other words dynamic EPC page management, which is to dynamically allocate/deallocate heap.
Dynamically relax/restrict EPC page permissions.

SGX2 instruction set:

SGX2 offers the below instructions to enable the aforementioned features. Please refer to Intel SDM, Chapter INTRODUCTION TO INTEL SOFTWARE GUARD EXTENSIONS for more details.

In-Kernel Driver Support:

SGX2 support in-kernel driver changes will probably be part of the 5.20 kernel which will be out sometime in the 1st week of October 2022.

My current PoC is based on the V4 version of the submitted kernel patch series. V5 seems to be the final one and the maintainers are satisfied. Since V5 has only a naming change (see below), the plan is to continue with V4, and once the PR is reviewed and validated by other teams, I plan to move to V5.

V5 Changes:
SGX_IOC_ENCLAVE_MODIFY_TYPES ioctl()'s struct was renamed from struct sgx_enclave_modify_type to struct sgx_enclave_modify_types.

User Level SGX2 IOCTLs exposed by in-kernel driver:

SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS: With this IOCTL the user specifies a page range and the Enclave Page Cache Map (EPCM) permissions to be applied to all pages in the provided range. ENCLS[EMODPR] is run to restrict the EPCM permissions followed by the ENCLS[ETRACK] flow that will ensure no cached linear-to-physical address mappings to the changed pages remain.
SGX_IOC_ENCLAVE_MODIFY_TYPES: This IOCTL is used to change the type of an enclave page from a regular (SGX_PAGE_TYPE_REG) enclave page to a TCS (SGX_PAGE_TYPE_TCS) page or change the type from a regular (SGX_PAGE_TYPE_REG) or TCS (SGX_PAGE_TYPE_TCS) page to a trimmed (SGX_PAGE_TYPE_TRIM) page (setting it up for later removal).
SGX_IOC_ENCLAVE_REMOVE_PAGES: With this IOCTL the user specifies a page range that should be removed. All pages in the provided range should have the SGX_PAGE_TYPE_TRIM page type or else the request will fail with EPERM (Operation not permitted). Page removal can fail on any page within the provided range. This IOCTL supports partial success by returning the number of pages that were successfully removed.

High-level Flow diagrams:

Page Allocation:

The page allocation sequence diagram shows how EPC pages within ELRANGE of the enclave are dynamically allocated. Below are the steps:

Enclave invokes ENCLU[EACCEPT] on a new page request which triggers a page fault (#PF) as the page is not available yet.
In-kernel driver catches this #PF and issues ENCLS[EAUG] for the page (at this point the page becomes VALID and may be used by the enclave).
Once the driver is done EAUGing the page, the control returns back to the untrusted PAL.
Untrusted PAL invokes ENCLU[ERESUME] to return control back to the enclave.
Enclave retries the same ENCLU[EACCEPT] and this time the instruction succeeds, and the page is dynamically allocated.

Page Deallocation (Removal):

The deallocation sequence removes an EPC page on the enclave’s request. Below are the steps:

Enclave calls in-kernel driver IOCTL(SGX_IOC_ENCLAVE_MODIFY_TYPES) to change the page's type to PT_TRIM.
Kernel invokes ENCLS[ETRACK] to track the page's address on all CPUs and issues IPI to flush stale TLB entries.
Enclave issues an ENCLU[EACCEPT] to accept changes to each EPC page.
Enclave notifies the kernel to remove EPC pages (SGX_IOC_ENCLAVE_REMOVE_PAGES IOCTL).
Kernel issues ENCLS[EREMOVE] to complete the request.

EPC page removal is expensive due to this 2-stage flow. And so, it needs some optimization around it.

Relaxing Page Permissions:

As the name indicates relaxing page permission extends page permission. For example, changing page permission from R ->RW. Below are the steps involved:

Enclave issues EMODPE for each page to extend the EPCM permissions associated with an EPC page.
Enclave then calls mprotect syscall to request the OS update the page tables to match the new EPCM permissions.

Step 2 can be skipped if there is no cached linear to physical address in the TLB, but if more restrictive permissions are present for a page, then it can lead to a #PF. To avoid this, it is better to proactively call mprotect which will exit the enclave clearing the TLB.

As an alternative to calling mprotect, there is an ongoing discussion with SGX architectural team about implementing a spurious exception handler that can analyze and ignore such faults due to stale TLB. Nothing conclusive yet.

Restricting Page Permissions:

As the name indicates restricting page permission limits page permission. For example, changing page permission from RW->R. Below are the steps involved:

Enclave calls in-kernel driver IOCTL (SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS) to restrict EPCM permission associated with an EPC page.
Kernel invokes ENCLS[EMODPR] and then ENCLS[ETRACK] to track removal of TLB address on all CPUs and issues IPI to flush stale TLB entries.
Enclave issues an ENCLU[EACCEPT] to accept the restricted page permission for each EPC page.

Optimizations:

Based on my tests with a few benchmarks, observed that naive implementation of SGX2 features impact performance in an adverse way. To overcome this, profiled and came up with the following optimizations.

Hybrid Allocation:

As the name indicates users can precisely set the amount of heap to preheat by setting the size and the remaining requests can be dynamically allocated. For example, when the size is "64M" Gramine will pre-fault the top 64M of heap pages and add it to the enclave. Any further requests are served dynamically. This is to balance the negative impact of EDMM on the total run time which shifts the page faults cost to the runtime phase.

Lazy Free:

Lazy free optimization introduces a manifest syntax that specifies the percentage of the total heap that can be freed in a lazy manner. Until this threshold is met, Gramine doesn't release any dynamically allocated memory. This optimization helps reduce the expensive enclave entries/exits associated with the dynamic freeing of EPC pages.

Implementation Steps:

Extend current code to store EPC page permission. (This will be a NOP but will help when enabling EDMM)
- Update heap_vma struct to store the page permission for each VMA region.
- Merge VMA regions only if they have the same permissions. If the newly requested VMA region overlaps with the existing region split and update the permissions only if the requested and the overlap permissions differ in permission.
Introduce dynamic page permissions.
- Add sgx.edmm_enable = true | false manifest option to turn on SGX2 features.
- Add OCALL support for SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS.
- Add support for mprotect syscall.
- Add support for relaxing/restricting page permissions.
- Enhance/introduce new LibOS test to validate page permission.
- Print warning on failure on the non-SGX2 system.
Introduce Naïve dynamic memory allocation.
- Update sign tool to skip heap region from measurement if sgx.edmm_enable_heap = true.
- Add support for dynamic heap allocation.
- Add support for dynamic heap deallocation.
- Enhance/introduce new LibOS test to validate dynamic page allocation/deallocation??
Introduce Hybrid optimization.
- Add preheat_size = “size” manifest option.
- Exclude this size from the top of the heap as we start allocation from the top.
Introduce Lazy free optimization.
- Add edmm_lazyfree_percentage = [NUM] manifest option to turn on SGX2 features where NUM is percentage of total heap that can be freed in a lazy manner.
- Here the idea is we don’t free EPC page until we hit the threshold but reuse it when the memory is requested. But must be careful not to ENCLU[EACCEPT] an already EACCEPT’ed page due to the following security issue.

NOTE

ENCLU[EACCEPT] on the already EACCEPTed page is forbidden due to the following security issue: Say page A is valid at a given VA. ENCLU[EACCEPT] on page A again will not be a problem. But with knowledge of the enclave issuing ENCLU[EACCEPT] on page A’s VA, an adversary could EAUG a new page B at the same VA. Then both pages A and B are now valid at the same VA. Hence the adversary can switch between pages A and B depending on what data it wants the enclave to see.

Testing Plans:

Should I add a LibOS unit test to dynamically mmap, unmmap, and change permissions for EPC memory or extend our current tests?

Since the in-kernel driver changes are not yet upstreamed, we will have to maintain the code and make sure it doesn’t break with any recent changes to the master. This will require us to set up a CI environment that would apply the EDMM changes and trigger our CI tests to ensure everything works. In case we see merge conflicts, I can resolve them and then push the latest changes. This cycle will continue until the kernel driver is released. Working with S3 team on this.

Previous Attempt:

Based on OOT driver I did have some initial support for EDMM in Graphene but OOT driver got deprecated and the effort was not pursued. But here is the github link, https://github.com/gramineproject/graphene/pull/2190

Next steps (Phase 2):

Plan is to add support for dynamic thread creation.
Spurious exception handler to address stable TLBs when page permission is relaxed?

Jun 24 '22 18:06 vijaydhanraj

Thanks for the great write-up! Very easy to follow.

Should I add a LibOS unit test to dynamically mmap, unmmap, and change permissions for EPC memory or extend our current tests?

Please add a separate LibOS test (maybe several, but I would prefer just one test). This way we can just mark this particular test as "requires-EDMM" and skip it in our normal CI, and only run it in the EDMM-enlightened CI.

Jun 28 '22 09:06 dimakuv

Please add a separate LibOS test (maybe several, but I would prefer just one test). This way we can just mark this particular test as "requires-EDMM" and skip it in our normal CI, and only run it in the EDMM-enlightened CI.

Sure will do, it makes sense to do it this way.

Jun 28 '22 20:06 vijaydhanraj

My current PoC is based on the V4 version of the submitted kernel patch series. V5 seems to be the final one and the maintainers are satisfied. Since V5 has only a naming change (see below), the plan is to continue with V4, and once the PR is reviewed and validated by other teams, I plan to move to V5.

Please rebase to V5 before submitting. I don't understand what's the point of reviewing code based on outdated upstream, just to later have to review the rebase diff...

What about the heap pool resizing? (as we discussed on the call - resizing it like std::vector from C++, but possibly with different ratios) It should generalize/supersede the hybrid/lazy approaches.

Jul 07 '22 15:07 mkow

OK will rebase to V5. Although things look good on the driver side, it is not yet confirmed if V5 will be the last version and so didn't want to keep moving unless there were user-space related changes.

What about the heap pool resizing?

Yes, looking into this. I will come up with an initial design and review it with maintainers.

It should generalize/supersede the hybrid/lazy approaches

Heap pool resizing is associated with how we free the heap, but hybrid optimization is to do with pre-allocating (using EADD) memory to offset the dynamic page allocation cost. So, heap pool optimization might not help with hybrid optimization.

Jul 07 '22 16:07 vijaydhanraj

Heap pool resizing is associated with how we free the heap

No, the idea is to also grow it in bigger chunks. Ofc. this assumes that user allocations are usually either next to each other or not MAP_FIXED, but I think that's the case for almost all apps.

Jul 07 '22 21:07 mkow

What about the heap pool resizing? (as we discussed on the call - resizing it like std::vector from C++, but possibly with different ratios) It should generalize/supersede the hybrid/lazy approaches. No, the idea is to also grow it in bigger chunks. Ofc. this assumes that user allocations are usually either next to each other or not MAP_FIXED, but I think that's the case for almost all apps.

There are two issues here:

std::vector has linear memory, where one part is completely allocated and the other is completely free and these two parts do not overlap. This is not the case here: we have and arbitrary layout of free and allocated ranges of memory. When you free an element in std::vector it shifts the memory - we cannot do this and need to leave a hole. Given the above this optimization might not work as intended. Or might work well, I have no idea, but it's not so obvious.
All memory map requests coming from LibOS have fixed address, so overallocation is not so obvious. Also requests could have different memory permissions and given the description in this issue, it's not clear to me that changing memory permissions (on overallocated pages) is faster than just allocating this memory.

Jul 08 '22 15:07 boryspoplawski

@boryspoplawski: My assumption is that in practice most LibOS allocation requests are just trying to expand heap via mmap to handle a small allocation from app's malloc().

All memory map requests coming from LibOS have fixed address

Oh, I forgot about this, it may actually be a huge obstacle for this idea :/

Jul 08 '22 23:07 mkow

Summarizing opens that were discussed offline:

Relaxing page permission: Relaxing page permission can result in a #PF if there was a stale TLB with more restrictive permission. To handle this, we plan to implement a spurious handler and remove the mprotect OCALL that was proposed in the design.
1. Extend current code to store EPC page permission. (This will be a NOP but will help when enabling EDMM)

Extending PAL to store EPC page permission would make it more complex. To simplify, the plan is to remove VMA booking from PAL and use LibOS VMA subsystem as it already stores page permissions. Please see https://github.com/gramineproject/gramine/issues/741 for more info.
EDMM Allocation: Currently the driver allocates using a page fault-based mechanism but in the upcoming release plans to add MAP_POPULATE support to pre-allocate memory. Will this new change be backward compatible so that Gramine works seamlessly? It might not, and we might need to make some changes in Gramine to handle mmap request with MAP_POPULATE before doing a EACCEPT. But given the current plans for driver, it looks like we have no option but to go with #PF based approach for now. The initial/PoC version of adding MAP_POPULATE support in the driver is here, https://lore.kernel.org/all/[email protected]/

Jul 13 '22 16:07 vijaydhanraj

One more thing, I forgot to mention is the use of enclave_size manifest option, https://gramine.readthedocs.io/en/stable/manifest-syntax.html#enclave-size. My current PoC and design assume we will still have this option and the user can specify a large upper bound value but would end up using EPC memory based on her actual/real memory requirement.

Jul 13 '22 16:07 vijaydhanraj

Thanks for the proposal and summary!

Lazy Free: Lazy free optimization introduces a manifest syntax that specifies the percentage of the total heap that can be freed in a lazy manner. Until this threshold is met, Gramine doesn't release any dynamically allocated memory. This optimization helps reduce the expensive enclave entries/exits associated with the dynamic freeing of EPC pages.

Quick Q: As the main cause of performance impact seems to be the expensive enclave entries/exits here, just wondering whether only specifying the percentage of the total heap for lazy free is enough? Should we also consider the number of memory ranges to free?

Jul 15 '22 03:07 kailun-qin

It is actually the number of memory ranges that are freed. How it works is that the percentage is converted to a threshold (in bytes) and whenever a memory range is freed by the application, the freed size is accumulated. When the accumulated free size grows above the threshold, the memory ranges are removed from the enclave.

The reason I chose percentage is that it is easier for the end-user to tune.

Jul 15 '22 05:07 vijaydhanraj

@vijaydhanraj The initial EDMM support was implemented in Gramine now.

We also have a separate issue on adding optimizations (like lazy allocation) to EDMM: https://github.com/gramineproject/gramine/issues/1099

Looks like the only thing left is adding a separate issue on dynamic thread creation with EDMM. Could you create such an issue?

Let me close this issue, since it is basically completed.

Mar 09 '23 13:03 dimakuv

@dimakuv there are two more tasks as part of this issue, 1) hybrid optimization 2) Lazy free optimization that are not complete yet. But agree we can close this and I can create another issue for EDMM optimizations or could reuse #1099 and add these optimizations as part of it and call it EDMM optimizations instead of lazy allocation.

Please let me know which is preferred.

Mar 09 '23 15:03 vijaydhanraj

I think it's better to create separate issues. I have a feeling only a subset of these three optimizations will be merged into Gramine (as the others may not yield sufficient perf gains). So we will maybe fix e.g. two of the three, but the third one we'll close as not relevant.

Mar 09 '23 15:03 dimakuv

Created separate issues for EDMM optimization: https://github.com/gramineproject/gramine/issues/1221, https://github.com/gramineproject/gramine/issues/1222 (label: Enhancement, Priority P0)

Created an issue for dynamic thread creation feature: https://github.com/gramineproject/gramine/issues/1223 (label: Feature, Priority P0)

Mar 09 '23 18:03 vijaydhanraj

Thanks @vijaydhanraj. Marked all these new issues with respective labels.

Mar 10 '23 08:03 dimakuv