OpenDiT
OpenDiT copied to clipboard
Google Colab setup env hit Cuda/extension version mismatch issue
Thank you so much for the great work!!!
I'm trying to setup the environment in Google Colab to train. but hit Cuda extension version mismatch issue. My python/pytorch/cuda version matches the requirement. Does anyone happen to know why? Really appreciated !!
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"
but hit below issue
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.
my env: python: 3.10 pytorch: 2.1.0 cuda: 12.1
full log:
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext" Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10) DEPRECATION: --build-option and --global-option are deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use --config-settings. Discussion can be found at https://github.com/pypa/pip/issues/11859 WARNING: Implying --no-binary=:all: due to the presence of --build-option / --global-option. Processing /content/ColossalAI/OpenDiT/apex Running command Preparing metadata (pyproject.toml)
torch.version = 2.1.0+cu121
running dist_info creating /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info writing /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/PKG-INFO writing dependency_links to /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/dependency_links.txt writing requirements to /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/requires.txt writing top-level names to /tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/top_level.txt writing manifest file '/tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/SOURCES.txt' reading manifest file '/tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/SOURCES.txt' adding license file 'LICENSE' writing manifest file '/tmp/pip-modern-metadata-6nsd1o2v/apex.egg-info/SOURCES.txt' creating '/tmp/pip-modern-metadata-6nsd1o2v/apex-0.1.dist-info' Preparing metadata (pyproject.toml) ... done Requirement already satisfied: packaging>20.6 in /usr/local/lib/python3.10/dist-packages (from apex==0.1) (23.2) Building wheels for collected packages: apex WARNING: Ignoring --global-option when building apex using PEP 517 Running command Building wheel for apex (pyproject.toml)
torch.version = 2.1.0+cu121
Compiling cuda extensions with nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0 from /usr/local/cuda/bin
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in
× Building wheel for apex (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip. full command: /usr/bin/python3 /usr/local/lib/python3.10/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_wheel /tmp/tmp6d28a180 cwd: /content/ColossalAI/OpenDiT/apex Building wheel for apex (pyproject.toml) ... error ERROR: Failed building wheel for apex Failed to build apex ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects
Hi, thanks for supporting our work!
It seems that your CUDA version mismatches with the apex version. Do you use a virtual Python environment? If not, maybe you can check the native CUDA version to see if it meets the requirements of apex. Maybe you can try to install apex by directly executing pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
without checking out to commit 741bdf50825a97664db08574981962d66436d16a. You can also check apex's repo for more instructions on apex installation.
Feel free to ask if you have further questions!
it seems that the pytorch cuda version does not match your system cuda version. the easy way to fix it is to install a new pytorch that aligns with your system cuda version