ColossalAI
ColossalAI copied to clipboard
[BUG]: tensornvme installation is incomplete in official docker images
🐛 Describe the bug
Description
The official docker images run the TensorNVME install commands, however at runtime, executing cd TensorNVMe && tensornvme check (or running the training demos depending on tensornvme) produces ImportError: libaio.so.1: cannot open shared object file: No such file or directory.
A few key observations from container runtime:
- LD_LIBRARY_PATH is not correctly updated in the image build. At runtime it is
LD_LIBRARY_PATH =/usr/local/nvidia/lib:/usr/local/nvidia/lib64. - Running
find / -type d -iname ".tensornvme" -lsdoes not locate the install directory.tensornvme. - The last command of the image build does not include WITH_ROOT=1, so the install directory is ~/.tensornvme versus /usr, and
~or$HOMEevaluates to/in the container runtime. - Attempting to install it at container runtime confirms the attempted install location is
/.tensornvme:
$ cd TensorNVMe
$ pip install -v --no-cache-dir .
Using pip 21.2.4 from /opt/conda/lib/python3.9/site-packages/pip (python 3.9)
Processing /workspace/TensorNVMe
DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
Running command python setup.py egg_info
running egg_info
creating /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info
writing /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/PKG-INFO
writing dependency_links to /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/dependency_links.txt
writing entry points to /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/entry_points.txt
writing requirements to /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/requires.txt
writing top-level names to /tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/top_level.txt
writing manifest file '/tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/SOURCES.txt'
reading manifest file '/tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file '/tmp/pip-pip-egg-info-ag4spu7n/tensornvme.egg-info/SOURCES.txt'
Requirement already satisfied: packaging in /opt/conda/lib/python3.9/site-packages (from tensornvme==0.1.0) (23.0)
Requirement already satisfied: click in /opt/conda/lib/python3.9/site-packages (from tensornvme==0.1.0) (8.1.3)
Requirement already satisfied: torch in /opt/conda/lib/python3.9/site-packages (from tensornvme==0.1.0) (1.12.1)
Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.9/site-packages (from torch->tensornvme==0.1.0) (4.4.0)
Building wheels for collected packages: tensornvme
Running command /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-tqkdcb5z
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-r7tnomax/setup.py", line 120, in <module>
setup_dependencies()
File "/tmp/pip-req-build-r7tnomax/setup.py", line 105, in setup_dependencies
os.makedirs(backend_install_dir, exist_ok=True)
File "/opt/conda/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/.tensornvme'
Building wheel for tensornvme (setup.py) ... error
ERROR: Failed building wheel for tensornvme
Running setup.py clean for tensornvme
Running command /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
running clean
'build/lib' does not exist -- can't clean it
'build/bdist.linux-x86_64' does not exist -- can't clean it
'build/scripts-3.9' does not exist -- can't clean it
Failed to build tensornvme
Installing collected packages: tensornvme
Attempting uninstall: tensornvme
Found existing installation: tensornvme 0.1.0
Uninstalling tensornvme-0.1.0:
Removing file or directory /opt/conda/bin/tensornvme
Removing file or directory /opt/conda/lib/python3.9/site-packages/tensornvme-0.1.0.dist-info/
Removing file or directory /opt/conda/lib/python3.9/site-packages/tensornvme/
Successfully uninstalled tensornvme-0.1.0
Running command /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-q5nfdluf/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/include/python3.9/tensornvme
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-r7tnomax/setup.py", line 120, in <module>
setup_dependencies()
File "/tmp/pip-req-build-r7tnomax/setup.py", line 105, in setup_dependencies
os.makedirs(backend_install_dir, exist_ok=True)
File "/opt/conda/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/.tensornvme'
Running setup.py install for tensornvme ... error
Rolling back uninstall of tensornvme
Moving to /opt/conda/bin/tensornvme
from /tmp/pip-uninstall-0rqpd__w/tensornvme
Moving to /opt/conda/lib/python3.9/site-packages/tensornvme-0.1.0.dist-info/
from /opt/conda/lib/python3.9/site-packages/~ensornvme-0.1.0.dist-info
Moving to /opt/conda/lib/python3.9/site-packages/tensornvme/
from /opt/conda/lib/python3.9/site-packages/~ensornvme
ERROR: Command errored out with exit status 1: /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-r7tnomax/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-q5nfdluf/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/include/python3.9/tensornvme Check the logs for full command output.
Environment
docker.io/hpcaitech/colossalai:0.2.7