gramine icon indicating copy to clipboard operation
gramine copied to clipboard

Python fails to load all site-packages/system path inside Gramine

Open anjalirai-intel opened this issue 2 years ago • 6 comments

Description of the problem

Python fails to load example/workload(Pytorch) when necessary libraries are in different site-packages even after mounting(CentOS)

Ref: https://github.com/gramineproject/examples/tree/master/pytorch

I was debugging Pytorch Example inside CentOS Docker Container. Initially it was failing with below mentioned error:

Traceback (most recent call last):
  File "./pytorchexample.py", line 4, in <module>
    from torchvision import models
  File "/usr/local/lib64/python3.6/site-packages/torchvision/__init__.py", line 4, in <module>
    from .extension import _HAS_OPS
  File "/usr/local/lib64/python3.6/site-packages/torchvision/extension.py", line 6, in <module>
    import torch
  File "/usr/local/lib64/python3.6/site-packages/torch/__init__.py", line 29, in <module>
    from .torch_version import __version__ as __version__
  File "/usr/local/lib64/python3.6/site-packages/torch/torch_version.py", line 3, in <module>
    from pkg_resources import packaging  # type: ignore[attr-defined]

ModuleNotFoundError: No module named 'pkg_resources'

I searched for pkg_resources in my system and found in below path

Torchvision: /usr/local/lib64/python3.6/site-packages pkg_resources: /usr/lib/python3.6/site-packages

Torchvision internally requires pkg_resources library which is installed in /usr/lib path. I mounted the pkg_resources path in manifest.template

[[fs.mounts]]
type = "chroot"
uri = "file:/usr/lib/python3.6/site-packages/"
path = "/usr/lib/python3.6/site-packages/"

Even after the mounting the test still fails with same error.

After debugging for a while, we noticed that python with gramine-direct does not have same sys.path as normal python has. None of the /usr/lib path were present inside sys.path and since it does not look there, it fails to find the libraries installed in other site-packages location

[intel@fc881c4f9bda sample]$ gramine-direct ./python get_path.py
['/', '/lib64/python36.zip', '/lib64/python3.6', '/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/lib64/python3.6/site-packages']

[intel@fc881c4f9bda sample]$ python3 get_path.py
['/home/intel/gramine_install/usr/lib64/python3.6/site-packages', '/home/intel/anjali/gramine/Scripts', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages']

Steps to reproduce

Install setuptools in different location like /usr/lib/python3.6/site-packages Install torch #By Default it loads in /usr/local/lib64 or /home/username/local

Expected results

sys.get_path should return all the system paths:

[/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages']

Actual results

['/', '/lib64/python36.zip', '/lib64/python3.6', '/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/lib64/python3.6/site-packages']

Gramine commit hash

49413a8a5817e67f055a22edeb8804def851deb8

anjalirai-intel avatar May 02 '22 12:05 anjalirai-intel

I confirm this. Me and Anjali (@anjalirai-intel) had a debug session. Neither of us has any clue why this discrepancy happens.

@woju ?

dimakuv avatar May 02 '22 13:05 dimakuv

loader.insecure__use_host_env = true in manifest could be the culprit. I suspect you fiddle with PYTHONPATH and such on host (which is often the case with Gramine) and then forward these to inside Gramine, where the paths are different (at least some).

boryspoplawski avatar May 02 '22 13:05 boryspoplawski

Hi @boryspoplawski

I tried with your observation and removed the loader.insecure__use_host_env = true from manifest

loader.insecure__use_host_env = true ['/', '/home/intel/gramine_install/usr/lib64/python3.6/site-packages', '/home/intel/anjali/test/gramine/Scripts', '/', '/lib64/python36.zip', '/lib64/python3.6', '/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/lib64/python3.6/site-packages']

loader.insecure__use_host_env = false ['/', '/lib64/python36.zip', '/lib64/python3.6', '/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/lib64/python3.6/site-packages']

I still don't see /usr/lib

anjalirai-intel avatar May 04 '22 05:05 anjalirai-intel

Maybe you can try adding loader.argv0_override="/usr/libexec/platform-python3.6" to the manifest? I think python guesses the path from argv[0].

lejunzhu avatar May 05 '22 06:05 lejunzhu

Hi @lejunzhu

Adding this to manifest file, workload passes and it returns correct system paths

gramine-direct ./python get_path.py ['/', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '/usr/local/lib64/python3.6/site-packages', '/usr/local/lib/python3.6/site-packages', '/usr/lib64/python3.6/site-packages', '/usr/lib/python3.6/site-packages']

anjalirai-intel avatar May 05 '22 06:05 anjalirai-intel

Quick update: @woju said that it's hard to explain because Python's distribution for each OS distro (Ubuntu, CentOS, Arch, etc.) is slightly different, and those path finding routines are different between them.

There seems to be no general workaround/way that would work for all users and for all OS distros.

The only reasonable workaround seems to be specifying loader.argv0_override manifest option as @lejunzhu mentioned, in a specific Docker image, with well-known paths (so that the paths can be hard-coded in the Gramine manifest file). @woju will hopefully join this thread with some explanations.

dimakuv avatar May 10 '22 14:05 dimakuv

@anjalirai-intel Is this something we need to keep open? Looks like the problem is non-trivial to solve, and not really the Gramine fault. The workaround also seems to work (well, the workaround now requires the usage of loader.argv since loader.argv0_override was deprecated).

dimakuv avatar Mar 09 '23 14:03 dimakuv

@dimakuv Yes, there is a workaround. If you want to close it with workaround that is also fine.

anjalirai-intel avatar Mar 11 '23 04:03 anjalirai-intel