DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Installed CUDA version 12.1 does not match the version torch was compiled with 11.8

Open ggyggy666 opened this issue 2 years ago • 6 comments

hello, when I execute "python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu",It stay in the step 1,and get an error:

raise Exception(f">- DeepSpeed Op Builder: Installed CUDA version {sys_cuda_version} does not match the " Exception: >- DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8, unable to compile cuda/cpp extensions without a matching cuda version.

My Cuda is 12.1, and torch-cuda is 11.8 conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

and I have installed cuda toolkit for wsl as follow: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local

I use it in wsl2-kali linux. What should I do?

ggyggy666 avatar Apr 14 '23 04:04 ggyggy666

I think you should give the success environment, such as the wsl-linux: ubuntu 20 or kali-linux or others. and cuda version, torch version and so on. you should give the entire process for installing... or if anyone have deployed it successfully, can you give some detailed process?

ggyggy666 avatar Apr 14 '23 04:04 ggyggy666

You'll need to have a torch version that matches the cuda version you have installed. This would be easiest to do by installing cuda 11.8 to match your torch-cuda version, or pip install a version of torch that is built with the cuda version you have.

WSL works for me with cuda 11.8, so I'd recommend trying that if you can, or is there a reason you need cuda 12.1?

loadams avatar Apr 14 '23 15:04 loadams

@ggyggy666 The correct solution to this problem is what @loadams shared, however I've found that CUDA 12.1 and torch compiled with 11.8 work fine together (⚠️however, we have not extensively tested this⚠️). You can disable this error by removing/commenting the Exception here: https://github.com/microsoft/DeepSpeed/blob/6fc8e33c12d89d641d55e7e7decbd29fb81b2ba2/op_builder/builder.py#L93

mrwyattii avatar Apr 14 '23 17:04 mrwyattii

@ggyggy666 - were you able to test if that worked for you?

loadams avatar Apr 19 '23 15:04 loadams

@ggyggy666 - were you able to test if that worked for you?

Thanks for your reply, I revise the file: DeepSpeed/op_builder/builder.py and install it again. Now it no longer reports this error. But unluckily, it appears another error, oom, my memory is not enough. So I have to Give up testing it. I'm a student after all.

ggyggy666 avatar Apr 19 '23 15:04 ggyggy666

@ggyggy666 you can try lowering the batch size in the bash script that train.py calls: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh

Add --per_device_train_batch_size 1 --per_device_eval_batch_size 1 and try again. But this will still require having some amount of GPU memory. Can you share the GPU type you are trying to run on?

mrwyattii avatar Apr 20 '23 21:04 mrwyattii

@ggyggy666 - let us know if you have a chance to try that, or other questions on trying to get it to run on your machine, we're happy to help with other suggestions, but we will close this issue for now. Re-open if needed.

loadams avatar Apr 27 '23 15:04 loadams

deepspeed is used in a lot of projects, you usually would not even see the source code to remove the assertion in.

By default you are getting cuda 12.1 now https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local so folks will have 12.1 present first often on new installs, then only to encounter that torch, deepspeed, and other dependencies have not caught up.

Maybe use a warning, not an exception.

In my case it wanted 11.7 even (enhuiz/vall-e repo).

MSDNAndi avatar Apr 30 '23 02:04 MSDNAndi

@MSDNAndi we have added an override for this exception in #3436 - You can test it by setting the following environment variable: DS_SKIP_CUDA_CHECK=1

mrwyattii avatar May 03 '23 22:05 mrwyattii

For reference. I have installed CUDA 12.5 as it's what currently installs using nvidia's repository. But the driver 535 for linux only supports up to CUDA version 12.2:

image

But according to the documentation, this versions of cuda in linux should be forward compatible too. But when overriding the version check I get errors: https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility

[WARNING] DeepSpeed Op Builder: Installed CUDA version 12.5 does not match the version torch was compiled with 12.1.Detected DS_SKIP_CUDA_CHECK=1: Allowing this combination of CUDA, but it may result in unexpected behavior. Using /home/ubuntuai/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... [1/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-12.5/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda-12.5/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o FAILED: cpu_adam_impl.o c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-12.5/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda-12.5/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o In file included from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/Device.h:4, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/python.h:8, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/extension.h:9, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp:6: /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory 12 | #include <Python.h> | ^~~~~~~~~~ compilation terminated. [2/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-12.5/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda-12.5/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o FAILED: cpu_adam.o c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-12.5/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda-12.5/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o In file included from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/Device.h:4, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/python.h:8, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/extension.h:9, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:12, from /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:6: /home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory 12 | #include <Python.h> | ^~~~~~~~~~ compilation terminated.

RodriMora avatar Jun 26 '24 09:06 RodriMora

Hi @RodriMora - that error looks to be unrelated to DeepSpeed, and if this is the full error:

/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory 12 | #include <Python.h> |          ^~~~~~~~~~ compilation terminated.

this would indicate an issue with not having the right python libs installed properly. Can you confirm that this is the full error that you are seeing? And if so, can you try suggestions like this one?

loadams avatar Jun 26 '24 14:06 loadams

Hi @RodriMora - that error looks to be unrelated to DeepSpeed, and if this is the full error:

/home/ubuntuai/axolotl/.venv/lib/python3.10/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory 12 | #include <Python.h> |          ^~~~~~~~~~ compilation terminated.

this would indicate an issue with not having the right python libs installed properly. Can you confirm that this is the full error that you are seeing? And if so, can you try suggestions like this one?

Thanks a lot! you were write, the deepspeed was just a warning but not the main error. Your suggestion fixed the problem.

RodriMora avatar Jun 26 '24 14:06 RodriMora

@ggyggy666 The correct solution to this problem is what @loadams shared, however I've found that CUDA 12.1 and torch compiled with 11.8 work fine together (⚠️however, we have not extensively tested this⚠️). You can disable this error by removing/commenting the Exception here:

https://github.com/microsoft/DeepSpeed/blob/6fc8e33c12d89d641d55e7e7decbd29fb81b2ba2/op_builder/builder.py#L93

yess, this works for me. Thank god I saw your cmts.

minhkids avatar Jun 28 '24 07:06 minhkids