gramine pytorch example failing for CentOS/RHEL machine

Description of the problem

The pytorch example from https://github.com/gramineproject/examples/tree/master/pytorch repository is failing with memory fault issue after a recent commit, Rewrite sysfs topology support on CentOS and RHEL machines. It continues to pass when tried with previous commit 2628ef6ba7df94fcdee811641d72bf2b835ca5d9.

Steps to reproduce

Run the below command gramine-sgx ./pytorch ./pytorchexample.py

Expected results

Actual results

The error log and manifest.sgx file are attached below. Also, pass log with previous commit is also shared. pytorch_error_log.txt pytorch_manifest_sgx.txt pytorch_pass_log_prev_commit.txt

Please feel free to ping me, incase you need the machine.

May 05 '22 11:05 aniket-intelx

Here is the snippet from the error log:

[P1:T115:platform-python3.6] trace: ---- shim_mmap(0, 0x8002000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0x0) ...
[P1:T115:platform-python3.6] trace: ---- return from shim_mmap(...) = -12
[P1:T115:platform-python3.6] debug: memory fault at 0x00000000 (IP = 0x00000000)
[P1:T115:platform-python3.6] debug: killed by signal 11

So it looks like Python is trying to allocate memory for many threads (notice 115th thread), and runs out of it under SGX (-12 = -ENOMEM). This feels very much like this issue: https://github.com/gramineproject/gramine/issues/342#issuecomment-1014475710

@aniket-intelx Please try to add loader.env.M_ARENA_MAX = "1" in the manifest file and rerun PyTorch.

Explanation: before the Rewrite sysfs topology support commit, Gramine required an explicit manifest option fs.experimental__enable_sysfs_topology = true. This manifest option is not set in the current PyTorch manifest file. So previously, running PyTorch resulted in no sysfs topology visible to Python/PyTorch. Now, with the new commit, sysfs topology is enabled by default, and PyTorch can see it and do some thread-related tweaks like asking for more memory. And that's why now it fails.

May 05 '22 11:05 dimakuv

@dimakuv the change is not helping, pytorch still fails with the same memory fault error at the same T115. I have pinged you the machine details and the workspace details personally on Teams.

May 05 '22 13:05 jinengandhi-intel

The example executes successfully after setting the sgx.enclave_size = "32G" without adding loader.env.M_ARENA_MAX = "1"

May 06 '22 05:05 aniket-intelx

@aniket-intelx Interesting. What about setting loader.env.OMP_NUM_THREADS = "8", for example? This is supposed to force PyTorch to use only 8 threads (instead of 115 or more).

For this experiment, sgx.enclave_size should stay at "16G". If we want to change the PyTorch manifest file, I'd prefer fixing the number of threads than increasing the enclave size...

May 06 '22 06:05 dimakuv

Setting loader.env.OMP_NUM_THREADS = "8" also works for this example.

May 06 '22 06:05 aniket-intelx

Nice! @aniket-intelx Are you planning to submit a PR to fix PyTorch example with this setting (OMP_NUM_THREADS) + a comment explaining this?

May 06 '22 06:05 dimakuv

Sure.

May 06 '22 07:05 aniket-intelx

@dimakuv can you close this issue as the fix for this was already pushed: https://github.com/gramineproject/examples/commit/2dba8a79c1d2042291809fa174da18d076b95910

Aniket is no longer with Intel and I don't have rights to close issues on his behalf.

Mar 08 '23 06:03 jinengandhi-intel