ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: tensornvme installation is incomplete in official docker images

Open MEllis-github opened this issue 2 years ago • 0 comments

🐛 Describe the bug

Description

The official docker images run the TensorNVME install commands, however at runtime, executing cd TensorNVMe && tensornvme check (or running the training demos depending on tensornvme) produces ImportError: libaio.so.1: cannot open shared object file: No such file or directory.

A few key observations from container runtime:

  • LD_LIBRARY_PATH is not correctly updated in the image build. At runtime it is LD_LIBRARY_PATH =/usr/local/nvidia/lib:/usr/local/nvidia/lib64.
  • Running find / -type d -iname ".tensornvme" -ls does not locate the install directory .tensornvme.
  • The last command of the image build does not include WITH_ROOT=1, so the install directory is ~/.tensornvme versus /usr, and ~ or $HOME evaluates to / in the container runtime.
  • Attempting to install it at container runtime confirms the attempted install location is /.tensornvme:
$ cd TensorNVMe 
$ pip install -v --no-cache-dir .
Using pip 21.2.4 from /opt/conda/lib/python3.9/site-packages/pip (python 3.9)
Processing /workspace/TensorNVMe
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
    Running command python setup.py egg_info
    running egg_info
    creating /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info
    writing /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/PKG-INFO
    writing dependency_links to /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/dependency_links.txt
    writing entry points to /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/entry_points.txt
    writing requirements to /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/requires.txt
    writing top-level names to /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/top_level.txt
    writing manifest file '/tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/SOURCES.txt'
    reading manifest file '/tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file '/tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/SOURCES.txt'
Requirement already satisfied: packaging in /opt/conda/lib/python3.9/site-packages (from tensornvme==0.1.0) (23.0)
Requirement already satisfied: click in /opt/conda/lib/python3.9/site-packages (from tensornvme==0.1.0) (8.1.3)
Requirement already satisfied: torch in /opt/conda/lib/python3.9/site-packages (from tensornvme==0.1.0) (1.12.1)
Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.9/site-packages (from torch->tensornvme==0.1.0) (4.4.0)
Building wheels for collected packages: tensornvme
  Running command /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-tqkdcb5z
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-req-build-r7tnomax/setup.py", line 120, in <module>
      setup_dependencies()
    File "/tmp/pip-req-build-r7tnomax/setup.py", line 105, in setup_dependencies
      os.makedirs(backend_install_dir, exist_ok=True)
    File "/opt/conda/lib/python3.9/os.py", line 225, in makedirs
      mkdir(name, mode)
  PermissionError: [Errno 13] Permission denied: '/.tensornvme'
  Building wheel for tensornvme (setup.py) ... error
  ERROR: Failed building wheel for tensornvme
  Running setup.py clean for tensornvme
  Running command /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
  running clean
  'build/lib' does not exist -- can't clean it
  'build/bdist.linux-x86_64' does not exist -- can't clean it
  'build/scripts-3.9' does not exist -- can't clean it
Failed to build tensornvme
Installing collected packages: tensornvme
  Attempting uninstall: tensornvme
    Found existing installation: tensornvme 0.1.0
    Uninstalling tensornvme-0.1.0:
      Removing file or directory /opt/conda/bin/tensornvme
      Removing file or directory /opt/conda/lib/python3.9/site-packages/tensornvme-0.1.0.dist-info/
      Removing file or directory /opt/conda/lib/python3.9/site-packages/tensornvme/
      Successfully uninstalled tensornvme-0.1.0
    Running command /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-q5nfdluf/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/include/python3.9/tensornvme
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-r7tnomax/setup.py", line 120, in <module>
        setup_dependencies()
      File "/tmp/pip-req-build-r7tnomax/setup.py", line 105, in setup_dependencies
        os.makedirs(backend_install_dir, exist_ok=True)
      File "/opt/conda/lib/python3.9/os.py", line 225, in makedirs
        mkdir(name, mode)
    PermissionError: [Errno 13] Permission denied: '/.tensornvme'
    Running setup.py install for tensornvme ... error
  Rolling back uninstall of tensornvme
  Moving to /opt/conda/bin/tensornvme
   from /tmp/pip-uninstall-0rqpd__w/tensornvme
  Moving to /opt/conda/lib/python3.9/site-packages/tensornvme-0.1.0.dist-info/
   from /opt/conda/lib/python3.9/site-packages/~ensornvme-0.1.0.dist-info
  Moving to /opt/conda/lib/python3.9/site-packages/tensornvme/
   from /opt/conda/lib/python3.9/site-packages/~ensornvme
ERROR: Command errored out with exit status 1: /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-q5nfdluf/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/include/python3.9/tensornvme Check the logs for full command output.

Environment

docker.io/hpcaitech/colossalai:0.2.7

MEllis-github avatar Apr 20 '23 20:04 MEllis-github