DeepSpeed
DeepSpeed copied to clipboard
[BUG] Installed CUDA version 12.1 does not match the version torch was compiled with 11.8
hello, when I execute "python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu",It stay in the step 1,and get an error:
raise Exception(f">- DeepSpeed Op Builder: Installed CUDA version {sys_cuda_version} does not match the " Exception: >- DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8, unable to compile cuda/cpp extensions without a matching cuda version.
My Cuda is 12.1, and torch-cuda is 11.8 conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
and I have installed cuda toolkit for wsl as follow: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local
I use it in wsl2-kali linux. What should I do?
I think you should give the success environment, such as the wsl-linux: ubuntu 20 or kali-linux or others. and cuda version, torch version and so on. you should give the entire process for installing... or if anyone have deployed it successfully, can you give some detailed process?
You'll need to have a torch version that matches the cuda version you have installed. This would be easiest to do by installing cuda 11.8 to match your torch-cuda version, or pip install a version of torch that is built with the cuda version you have.
WSL works for me with cuda 11.8, so I'd recommend trying that if you can, or is there a reason you need cuda 12.1?
@ggyggy666 The correct solution to this problem is what @loadams shared, however I've found that CUDA 12.1 and torch compiled with 11.8 work fine together (⚠️however, we have not extensively tested this⚠️). You can disable this error by removing/commenting the Exception here: https://github.com/microsoft/DeepSpeed/blob/6fc8e33c12d89d641d55e7e7decbd29fb81b2ba2/op_builder/builder.py#L93
@ggyggy666 - were you able to test if that worked for you?
@ggyggy666 - were you able to test if that worked for you?
Thanks for your reply, I revise the file: DeepSpeed/op_builder/builder.py and install it again. Now it no longer reports this error. But unluckily, it appears another error, oom, my memory is not enough. So I have to Give up testing it. I'm a student after all.
@ggyggy666 you can try lowering the batch size in the bash script that train.py calls: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
Add --per_device_train_batch_size 1 --per_device_eval_batch_size 1 and try again. But this will still require having some amount of GPU memory. Can you share the GPU type you are trying to run on?
@ggyggy666 - let us know if you have a chance to try that, or other questions on trying to get it to run on your machine, we're happy to help with other suggestions, but we will close this issue for now. Re-open if needed.
deepspeed is used in a lot of projects, you usually would not even see the source code to remove the assertion in.
By default you are getting cuda 12.1 now https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local so folks will have 12.1 present first often on new installs, then only to encounter that torch, deepspeed, and other dependencies have not caught up.
Maybe use a warning, not an exception.
In my case it wanted 11.7 even (enhuiz/vall-e repo).
@MSDNAndi we have added an override for this exception in #3436 - You can test it by setting the following environment variable: DS_SKIP_CUDA_CHECK=1
For reference. I have installed CUDA 12.5 as it's what currently installs using nvidia's repository. But the driver 535 for linux only supports up to CUDA version 12.2:
But according to the documentation, this versions of cuda in linux should be forward compatible too. But when overriding the version check I get errors: https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility
[WARNING] DeepSpeed Op Builder: Installed CUDA version 12.5 does not match the version torch was compiled with 12.1.Detected DS_SKIP_CUDA_CHECK=1: Allowing this combination of CUDA, but it may result in unexpected behavior. Using /home/ubuntuai/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... [1/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-12.5/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda-12.5/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o FAILED: cpu_adam_impl.o c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-12.5/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda-12.5/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o In file included from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/Device.h:4, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/python.h:8, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/extension.h:9, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp:6: /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory 12 | #include <Python.h> | ^~~~~~~~~~ compilation terminated. [2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-12.5/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda-12.5/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o FAILED: cpu_adam.o c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-12.5/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda-12.5/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o In file included from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/Device.h:4, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/python.h:8, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/extension.h:9, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:12, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:6: /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory 12 | #include <Python.h> | ^~~~~~~~~~ compilation terminated.
Hi @RodriMora - that error looks to be unrelated to DeepSpeed, and if this is the full error:
/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory 12 | #include <Python.h> | ^~~~~~~~~~ compilation terminated.
this would indicate an issue with not having the right python libs installed properly. Can you confirm that this is the full error that you are seeing? And if so, can you try suggestions like this one?
Hi @RodriMora - that error looks to be unrelated to DeepSpeed, and if this is the full error:
/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory 12 | #include <Python.h> | ^~~~~~~~~~ compilation terminated.this would indicate an issue with not having the right python libs installed properly. Can you confirm that this is the full error that you are seeing? And if so, can you try suggestions like this one?
Thanks a lot! you were write, the deepspeed was just a warning but not the main error. Your suggestion fixed the problem.
@ggyggy666 The correct solution to this problem is what @loadams shared, however I've found that CUDA 12.1 and torch compiled with 11.8 work fine together (⚠️however, we have not extensively tested this⚠️). You can disable this error by removing/commenting the
Exceptionhere:https://github.com/microsoft/DeepSpeed/blob/6fc8e33c12d89d641d55e7e7decbd29fb81b2ba2/op_builder/builder.py#L93
yess, this works for me. Thank god I saw your cmts.