gramine icon indicating copy to clipboard operation
gramine copied to clipboard

[LibOS] Child process fails during `libos_init` when all TCS are in use

Open forkthus opened this issue 5 months ago • 0 comments

Description of the problem

While investigating the Jenkins-SGX-22.04-Sanitizers build failure on PR #2131, I traced the issue to libos_init, not to the PR itself. A child process occasionally aborts with:

(host_thread.c:310:pal_thread_init) error: There are no available TCS pages left for a new thread. Please try to increase sgx.max_threads in the manifest. The current value is 4

This error raises during this call: https://github.com/gramineproject/gramine/blob/ff71d7afea730dffd56a97af39bb6a73ee6c7662/libos/src/libos_init.c#L479

I think the cause is: At the moment the TLS-handshake thread is created in the above call, all four TCS slots in the manifest may already be occupied:

  1. Main thread
  2. IPC worker thread
  3. Async worker thread
  4. TLS-handshake thread spawned insde init_ipc_worker

If the handshake helper thread from init_ipc_worker has not yet unmapped its TCS (unmap_my_tcs), connect_to_process cannot allocate a TCS for the new helper thread and the child exits with the error above.

I logged calls to unmap_my_tcs. In failing runs, exactly one fewer unmap_my_tcs message appears in the log compared to successful runs before connect_to_process attempted to establish its own TLS-handshake thread.

I think this directly reproduces the “no available TCS pages” error and confirms the thread contention described above.

Steps to reproduce

I found this issue while debugging the failure of rlimit_stack on the fork of PR #2131. However, I can reproduce the same issue on the main branch.

  1. Use the docker image provided in .ci and clone the main branch of Gramine
  2. Build Gramine in SGX debug mode with ASan and UBSan enabled
CC=clang CXX=clang++ meson setup build/ --werror --prefix=/workspace/install --buildtype=debug -Ddirect=disabled -Dsgx=enabled -Dtests=enabled -Dlibc=glibc -Dubsan=enabled -Dasan=enabled
  1. Build and install Gramine, thencd libos/test/regression and gramine-manifest and gramine-sgx-sign the rlimit_stack manifest
  2. Repeating gramine-sgx rlimit_stack until it fails. I used this command to reproduce the failure (usually within a few minutes):
while gramine-sgx rlimit_stack | grep -q "TEST OK"; do     echo "TEST OK found, running again..."; clear; done

Expected results

The TLS-handshake thread created by init_ipc_worker unmaps its TCS before connect_to_process creates its own handshake thread, leaving at least one TCS slot free.

Actual results

Intermittently, the first handshake thread has not yet unmapped its TCS. All four default slots are still being occupied, so connect_to_process fails to create the second handshake thread and the child exits with the error above.

Gramine commit hash

ff71d7afea730dffd56a97af39bb6a73ee6c7662 / 8e0313a5f3ad9f505f583cf6975d48d4ed1eea72

forkthus avatar Jul 11 '25 19:07 forkthus