pytorch example failing for CentOS/RHEL machine
Description of the problem
The pytorch example from https://github.com/gramineproject/examples/tree/master/pytorch repository is failing with memory fault issue after a recent commit, Rewrite sysfs topology support on CentOS and RHEL machines. It continues to pass when tried with previous commit 2628ef6ba7df94fcdee811641d72bf2b835ca5d9.
Steps to reproduce
Run the below command
gramine-sgx ./pytorch ./pytorchexample.py
Expected results
Actual results
The error log and manifest.sgx file are attached below. Also, pass log with previous commit is also shared.
pytorch_error_log.txt
pytorch_manifest_sgx.txt
pytorch_pass_log_prev_commit.txt
Please feel free to ping me, incase you need the machine.
Here is the snippet from the error log:
[P1:T115:platform-python3.6] trace: ---- shim_mmap(0, 0x8002000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0x0) ...
[P1:T115:platform-python3.6] trace: ---- return from shim_mmap(...) = -12
[P1:T115:platform-python3.6] debug: memory fault at 0x00000000 (IP = 0x00000000)
[P1:T115:platform-python3.6] debug: killed by signal 11
So it looks like Python is trying to allocate memory for many threads (notice 115th thread), and runs out of it under SGX (-12 = -ENOMEM). This feels very much like this issue: https://github.com/gramineproject/gramine/issues/342#issuecomment-1014475710
@aniket-intelx Please try to add loader.env.M_ARENA_MAX = "1" in the manifest file and rerun PyTorch.
Explanation: before the Rewrite sysfs topology support commit, Gramine required an explicit manifest option fs.experimental__enable_sysfs_topology = true. This manifest option is not set in the current PyTorch manifest file. So previously, running PyTorch resulted in no sysfs topology visible to Python/PyTorch. Now, with the new commit, sysfs topology is enabled by default, and PyTorch can see it and do some thread-related tweaks like asking for more memory. And that's why now it fails.
@dimakuv the change is not helping, pytorch still fails with the same memory fault error at the same T115. I have pinged you the machine details and the workspace details personally on Teams.
The example executes successfully after setting the sgx.enclave_size = "32G" without adding loader.env.M_ARENA_MAX = "1"
@aniket-intelx Interesting. What about setting loader.env.OMP_NUM_THREADS = "8", for example? This is supposed to force PyTorch to use only 8 threads (instead of 115 or more).
For this experiment, sgx.enclave_size should stay at "16G". If we want to change the PyTorch manifest file, I'd prefer fixing the number of threads than increasing the enclave size...
Setting loader.env.OMP_NUM_THREADS = "8" also works for this example.
Nice! @aniket-intelx Are you planning to submit a PR to fix PyTorch example with this setting (OMP_NUM_THREADS) + a comment explaining this?
Sure.
@dimakuv can you close this issue as the fix for this was already pushed: https://github.com/gramineproject/examples/commit/2dba8a79c1d2042291809fa174da18d076b95910
Aniket is no longer with Intel and I don't have rights to close issues on his behalf.