gramine icon indicating copy to clipboard operation
gramine copied to clipboard

[LibOS] `execve`: eliminate race on `clear_child_tid` before VMAs deallocation

Open forkthus opened this issue 5 months ago • 7 comments

Description of the changes

Fixes #2148

This PR makes execve() wait until every sibling thread’s *clear_child_tid is zeroed before deallocating their VMAs.

Implementation details:

  1. Grab the g_thread_list as soon as the calling thread acquires first.
  2. For each sibling thread, the calling thread will check: If *clear_child_tid != 0, invoke futex_wait(); it will be awakened by release_clear_child_tid() via futex_wake().
  3. After all other threads' (except the main thread) *clear_child_tid are cleared, the calling thread then starts to deallocate VMAs.

How to test this PR?

Repeating gramine-sgx exec_same [args_#1...args_#49]

Without this PR – the test usually fails within a few minutes, especially on the branch of PR #1795 because of the issue mentioned above. The main branch takes longer to fail. With this PR applied – the same loop runs for hours without any failures on the branch of PR #1795 .


This change is Reviewable

forkthus avatar Jul 30 '25 19:07 forkthus

Jenkins, test this please

mkow avatar Aug 04 '25 10:08 mkow

Jenkins, retest this please

(All failures seem to be connectivity issues.)

ERROR: Checkout failed
[2025-08-04T11:02:59.002Z] java.io.StreamCorruptedException: invalid stream header: 636F7272
...
25-08-04T11:02:59.002Z] Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to penguins-3-noble
[2025-08-04T11:02:59.002Z] Caused: hudson.remoting.RequestAbortedException

[2025-08-04T11:15:36.177Z] Connecting to busybox.net (busybox.net)|140.211.167.122|:443... connected.
[2025-08-04T11:16:44.937Z] Unable to establish SSL connection.
[2025-08-04T11:16:44.937Z] download: WARNING: Hash mismatch: Expected 415fbd89e5344c96acf449d94a6f956dbed62e18e835fc83e064db33a34bd549 but received e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
[2025-08-04T11:16:44.937Z] download: ERROR: Failed to download 'busybox.tar.bz2' (415fbd89...)! No URLs left to try.
[2025-08-04T11:16:44.937Z] make: *** [Makefile:13: busybox.tar.bz2] Error 1

forkthus avatar Aug 04 '25 21:08 forkthus

Jenkins, retest this please

(All failures seem to be connectivity issues.)

ERROR: Checkout failed
[2025-08-04T11:02:59.002Z] java.io.StreamCorruptedException: invalid stream header: 636F7272
...
25-08-04T11:02:59.002Z] Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to penguins-3-noble
[2025-08-04T11:02:59.002Z] Caused: hudson.remoting.RequestAbortedException
[2025-08-04T11:15:36.177Z] Connecting to busybox.net (busybox.net)|140.211.167.122|:443... connected.
[2025-08-04T11:16:44.937Z] Unable to establish SSL connection.
[2025-08-04T11:16:44.937Z] download: WARNING: Hash mismatch: Expected 415fbd89e5344c96acf449d94a6f956dbed62e18e835fc83e064db33a34bd549 but received e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
[2025-08-04T11:16:44.937Z] download: ERROR: Failed to download 'busybox.tar.bz2' (415fbd89...)! No URLs left to try.
[2025-08-04T11:16:44.937Z] make: *** [Makefile:13: busybox.tar.bz2] Error 1

Could someone kindly help me trigger a Jenkins retest? My retest command doesn’t seem to work—possibly due to a permission issue. I also don’t have access to the build logs.

forkthus avatar Aug 04 '25 22:08 forkthus

Jenkins, retest this please

kailun-qin avatar Aug 05 '25 08:08 kailun-qin

Add to whitelist

donporter avatar Aug 06 '25 20:08 donporter

The deb job failed with: The repository 'http://deb.debian.org/debian bullseye-backports Release' no longer has a Release file.

This looks legit, and unrelated to this PR.

donporter avatar Aug 06 '25 21:08 donporter

Jenkins, retest this please. Just seeing if the webhook gets reactivated.

donporter avatar Aug 20 '25 19:08 donporter