gramine icon indicating copy to clipboard operation
gramine copied to clipboard

pytorch example failing for CentOS/RHEL machine

Open aniket-intelx opened this issue 3 years ago • 7 comments

Description of the problem

The pytorch example from https://github.com/gramineproject/examples/tree/master/pytorch repository is failing with memory fault issue after a recent commit, Rewrite sysfs topology support on CentOS and RHEL machines. It continues to pass when tried with previous commit 2628ef6ba7df94fcdee811641d72bf2b835ca5d9.

Steps to reproduce

Run the below command gramine-sgx ./pytorch ./pytorchexample.py

Expected results

Actual results

The error log and manifest.sgx file are attached below. Also, pass log with previous commit is also shared. pytorch_error_log.txt pytorch_manifest_sgx.txt pytorch_pass_log_prev_commit.txt

Please feel free to ping me, incase you need the machine.

aniket-intelx avatar May 05 '22 11:05 aniket-intelx

Here is the snippet from the error log:

[P1:T115:platform-python3.6] trace: ---- shim_mmap(0, 0x8002000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0x0) ...
[P1:T115:platform-python3.6] trace: ---- return from shim_mmap(...) = -12
[P1:T115:platform-python3.6] debug: memory fault at 0x00000000 (IP = 0x00000000)
[P1:T115:platform-python3.6] debug: killed by signal 11

So it looks like Python is trying to allocate memory for many threads (notice 115th thread), and runs out of it under SGX (-12 = -ENOMEM). This feels very much like this issue: https://github.com/gramineproject/gramine/issues/342#issuecomment-1014475710

@aniket-intelx Please try to add loader.env.M_ARENA_MAX = "1" in the manifest file and rerun PyTorch.

Explanation: before the Rewrite sysfs topology support commit, Gramine required an explicit manifest option fs.experimental__enable_sysfs_topology = true. This manifest option is not set in the current PyTorch manifest file. So previously, running PyTorch resulted in no sysfs topology visible to Python/PyTorch. Now, with the new commit, sysfs topology is enabled by default, and PyTorch can see it and do some thread-related tweaks like asking for more memory. And that's why now it fails.

dimakuv avatar May 05 '22 11:05 dimakuv

@dimakuv the change is not helping, pytorch still fails with the same memory fault error at the same T115. I have pinged you the machine details and the workspace details personally on Teams.

jinengandhi-intel avatar May 05 '22 13:05 jinengandhi-intel

The example executes successfully after setting the sgx.enclave_size = "32G" without adding loader.env.M_ARENA_MAX = "1"

aniket-intelx avatar May 06 '22 05:05 aniket-intelx

@aniket-intelx Interesting. What about setting loader.env.OMP_NUM_THREADS = "8", for example? This is supposed to force PyTorch to use only 8 threads (instead of 115 or more).

For this experiment, sgx.enclave_size should stay at "16G". If we want to change the PyTorch manifest file, I'd prefer fixing the number of threads than increasing the enclave size...

dimakuv avatar May 06 '22 06:05 dimakuv

Setting loader.env.OMP_NUM_THREADS = "8" also works for this example.

aniket-intelx avatar May 06 '22 06:05 aniket-intelx

Nice! @aniket-intelx Are you planning to submit a PR to fix PyTorch example with this setting (OMP_NUM_THREADS) + a comment explaining this?

dimakuv avatar May 06 '22 06:05 dimakuv

Sure.

aniket-intelx avatar May 06 '22 07:05 aniket-intelx

@dimakuv can you close this issue as the fix for this was already pushed: https://github.com/gramineproject/examples/commit/2dba8a79c1d2042291809fa174da18d076b95910

Aniket is no longer with Intel and I don't have rights to close issues on his behalf.

jinengandhi-intel avatar Mar 08 '23 06:03 jinengandhi-intel