DeepSpeed
                                
                                 DeepSpeed copied to clipboard
                                
                                    DeepSpeed copied to clipboard
                            
                            
                            
                        [BUG] Fail to compile Deepspeed 0.9.0 with CUDA 11.7 and PyTorch 1.13.1 with Docker.
Describe the bug
I am building a docker image via Github Action, I installed pytorch 1.13.1 with cuda 11.7. Then when I am trying to install deepspeed 0.9.0 by DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir deepspeed --global-option="build_ext" --global-option="-j8", it fails with the error error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.
Since it only return nvcc error without no more information, I have no idea to correct it.
To Reproduce Steps to reproduce the behavior:
- Build this Dockerfile https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile
- See error
Expected behavior
It would fail with error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.
ds_report output
I can not run ds_report due to wrong compiling.
Screenshots I am building docker image via Github Action, the building log is avaiable at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4700207656/jobs/8334587489
System info (please complete the following information):
I am building on Github Action paltform with ubuntu-latest environment, detailed workflow.yml can be found at https://github.com/chenyaofo/docker-image-open-builder/blob/main/.github/workflows/build.yml
Launcher context N/A
Docker context N/A
Additional context N/A
I was able to repro this using your Dockerfile, but I do see this error in the logs:
#8 439.8       g++ -pthread -B /opt/conda/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/envs/dev/lib -Wl,-rpath-link,/opt/conda/envs/dev/lib -L/opt/conda/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/envs/dev/lib -Wl,-rpath-link,/opt/conda/envs/dev/lib -L/opt/conda/envs/dev/lib build/temp.linux-x86_64-cpython-310/csrc/utils/flatten_unflatten.o -L/opt/conda/envs/dev/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/deepspeed/ops/utils_op.cpython-310-x86_64-linux-gnu.so
#8 439.8       /opt/conda/envs/dev/compiler_compat/ld: build/temp.linux-x86_64-cpython-310/csrc/utils/flatten_unflatten.o: relocation R_X86_64_TPOFF32 against hidden symbol `_ZZN8pybind116handle15inc_ref_counterEmE7counter' can not be used when making a shared object
#8 439.8       /opt/conda/envs/dev/compiler_compat/ld: failed to set dynamic section sizes: bad value
#8 439.8       collect2: error: ld returned 1 exit status
Is there a reason that you need to build the ops instead of JIT? We will want to understand what's causing this, but that might unblock you in the meantime. Or do you see the same error when using JIT?
Also we do not have compatibility with triton 2.0.0 yet, so you may want to try building with pip install triton==1.0.0 to see if that resolves any issues as well.
I was able to repro this using your Dockerfile, but I do see this error in the logs:
#8 439.8 g++ -pthread -B /opt/conda/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/envs/dev/lib -Wl,-rpath-link,/opt/conda/envs/dev/lib -L/opt/conda/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/envs/dev/lib -Wl,-rpath-link,/opt/conda/envs/dev/lib -L/opt/conda/envs/dev/lib build/temp.linux-x86_64-cpython-310/csrc/utils/flatten_unflatten.o -L/opt/conda/envs/dev/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/deepspeed/ops/utils_op.cpython-310-x86_64-linux-gnu.so #8 439.8 /opt/conda/envs/dev/compiler_compat/ld: build/temp.linux-x86_64-cpython-310/csrc/utils/flatten_unflatten.o: relocation R_X86_64_TPOFF32 against hidden symbol `_ZZN8pybind116handle15inc_ref_counterEmE7counter' can not be used when making a shared object #8 439.8 /opt/conda/envs/dev/compiler_compat/ld: failed to set dynamic section sizes: bad value #8 439.8 collect2: error: ld returned 1 exit statusIs there a reason that you need to build the ops instead of JIT? We will want to understand what's causing this, but that might unblock you in the meantime. Or do you see the same error when using JIT?
I am building docker image for Deepspeed-Chat, if I running with JIT, that will be OK without any errors.
I am just curious why I can not pre-compile deepspeed in docker?
Also we do not have compatibility with triton 2.0.0 yet, so you may want to try building with
pip install triton==1.0.0to see if that resolves any issues as well.
I also try building with pip install triton==1.0.0, the dockerfile is at https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile-with-triton1.0.0 , the building logs is at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4706335613/jobs/8347496571 .
But the error seems to be the same, error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.
Yes, it should be fine to just use the JIT ops, pre-compiling isn't necessary. I'm not sure why it errors out yet, but wanted to be able to unblock in the meantime while I continue to look.
This should be resolved if you can try with the latest master branch
This should be resolved if you can try with the latest master branch
Following your suggestions, I try to build deepseed from lastest master branch, the dockerfile is at https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile-with-triton1.0.0-from-source , it seems still to fail with error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1, building github action is at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4745398839/jobs/8427660717
我发现,通过DS_BUILD_OPS=1参数去build的话会失败,于是我逐个op的去build。 DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_FUSED_LAMB=1 DS_BUILD_SPARSE_ATTN=1 DS_BUILD_UTILS=1 DS_BUILD_AIO=1 这些op加上都能build成功,一旦加上DS_BUILD_TRANSFORMER、DS_BUILD_TRANSFORMER_INFERENCE、DS_BUILD_STOCHASTIC_TRANSFORMER等XX_TRANSFORMER的op就会报build失败,具体为什么失败也看不出来。
另外,执行single_gpu时能正常运行,但是执行single_node时报错,如下: [2023-05-02 10:11:48,598] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2+b0d9c4d0, git-hash=b0d9c4d0, git-branch=master [2023-05-02 10:11:48,599] [INFO] [comm.py:616:init_distributed] Distributed backend already initialized [2023-05-02 10:11:49,907] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2465 [2023-05-02 10:11:50,162] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2466 [2023-05-02 10:11:50,163] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-125m', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--gradient_checkpointing', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', './output'] exits with return code = -7
NVIDIA-SMI 530.41.03 Driver Version: 531.41 CUDA Version: 12.1 NVIDIA GeForce RTX 3070 NVIDIA GeForce RTX 3060
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.2+b0d9c4d0, b0d9c4d0, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
我发现,通过DS_BUILD_OPS=1参数去build的话会失败,于是我逐个op的去build。 DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_FUSED_LAMB=1 DS_BUILD_SPARSE_ATTN=1 DS_BUILD_UTILS=1 DS_BUILD_AIO=1 这些op加上都能build成功,一旦加上DS_BUILD_TRANSFORMER、DS_BUILD_TRANSFORMER_INFERENCE、DS_BUILD_STOCHASTIC_TRANSFORMER等XX_TRANSFORMER的op就会报build失败,具体为什么失败也看不出来。
另外,执行single_gpu时能正常运行,但是执行single_node时报错,如下: [2023-05-02 10:11:48,598] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2+b0d9c4d0, git-hash=b0d9c4d0, git-branch=master [2023-05-02 10:11:48,599] [INFO] [comm.py:616:init_distributed] Distributed backend already initialized [2023-05-02 10:11:49,907] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2465 [2023-05-02 10:11:50,162] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2466 [2023-05-02 10:11:50,163] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-125m', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--gradient_checkpointing', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', './output'] exits with return code = -7
NVIDIA-SMI 530.41.03 Driver Version: 531.41 CUDA Version: 12.1 NVIDIA GeForce RTX 3070 NVIDIA GeForce RTX 3060
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [YES] ...... [OKAY] DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.2+b0d9c4d0, b0d9c4d, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
https://github.com/microsoft/DeepSpeed/issues/2632根据这个,解决了“single_gpu时能正常运行,但是执行single_node时报错”的问题。原因是使用容器运行,shm默认为64M,造成多进程内存溢出。在run容器的时候加上--shm-size="16g"就解决了。
@l241025097 - this looks like a different issue than @chenyaofo's above. Could you open a new ticket for this? Since yours seems to be building enough to at least run ds_report although its not installing all ops, but that's another problem.
@chenyaofo - do you have cuda-toolkit installed? Also since you're building on a node that isn't the one you're running on, so the capabilities of that are being used for the build since we do not support cross-compilation.
Would it be possible for you to try and build your docker image on the machine with the GPU you're using?
@chenyaofo - I tried making my own dockerfile to test this, and I'm able to get the below working. I'm not familiar with the needs of your system, but I believe something in the conda setup isn't working properly with the build, since I'm also able to modify yours to work.
To modify yours to work, I've replaced the build line with this:
RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeed
This slows the build down slightly, but does compile properly.
Successful output is here:
[+] Building 982.1s (10/10) FINISHED
 => [internal] load .dockerignore                                                                                  0.0s
 => => transferring context: 2B                                                                                    0.0s
 => [internal] load build definition from Dockerfile                                                               0.1s
 => => transferring dockerfile: 1.68kB                                                                             0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04                                    0.0s
 => [1/6] FROM docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04                                                      0.0s
 => CACHED [2/6] RUN APT_INSTALL="apt-get install -y --no-install-recommends --no-install-suggests" &&     GIT_CL  0.0s
 => CACHED [3/6] RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_  0.0s
 => CACHED [4/6] RUN /opt/conda/bin/mamba create -n dev python=3.10 &&     CONDA_INSTALL="/opt/conda/bin/mamba in  0.0s
 => CACHED [5/6] RUN PIP_INSTALL="/opt/conda/envs/dev/bin/pip install --no-cache-dir" &&     $PIP_INSTALL torch==  0.0s
 => [6/6] RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeed
A sample docker file that works fine:
FROM nvidia/cuda:11.7.1-devel-ubuntu20.04
ARG DEBIAN_FRONTEND=noninteractive
SHELL [ "/bin/bash","-c" ]
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,video,utility
RUN apt update -y \
&& apt upgrade -y
RUN apt install wget -y \
&& apt install git -y \ 
&& apt install libaio-dev -y \
&& apt install libaio1 -y 
RUN apt install python3.9 -y \
&& apt install python3-pip -y \
&& apt install python-is-python3 -y
RUN pip install --upgrade pip setuptools wheel
RUN pip install ninja
RUN pip install torch torchvision torchaudio
RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 pip install git+https://github.com/microsoft/[email protected]
CMD ["bash"]
Describe the bug I am building a docker image via Github Action, I installed pytorch 1.13.1 with cuda 11.7. Then when I am trying to install deepspeed 0.9.0 by
DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir deepspeed --global-option="build_ext" --global-option="-j8", it fails with the errorerror: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.Since it only return nvcc error without no more information, I have no idea to correct it.
To Reproduce Steps to reproduce the behavior:
1. Build this Dockerfile https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile 2. See errorExpected behavior It would fail with
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.ds_report output I can not run
ds_reportdue to wrong compiling.Screenshots I am building docker image via Github Action, the building log is avaiable at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4700207656/jobs/8334587489
System info (please complete the following information): I am building on Github Action paltform with
ubuntu-latestenvironment, detailedworkflow.ymlcan be found at https://github.com/chenyaofo/docker-image-open-builder/blob/main/.github/workflows/build.ymlLauncher context N/A
Docker context N/A
Additional context N/A
I encountered the same issues, and I built a new docker image to solve it. You can use it as following:
docker pull jockeyyan/deepspeed:torch113_cuda117v2.0
@chenyaofo - I tried making my own dockerfile to test this, and I'm able to get the below working. I'm not familiar with the needs of your system, but I believe something in the conda setup isn't working properly with the build, since I'm also able to modify yours to work.
To modify yours to work, I've replaced the build line with this:
RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeedThis slows the build down slightly, but does compile properly.
Successful output is here:
[+] Building 982.1s (10/10) FINISHED => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load build definition from Dockerfile 0.1s => => transferring dockerfile: 1.68kB 0.0s => [internal] load metadata for docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04 0.0s => [1/6] FROM docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04 0.0s => CACHED [2/6] RUN APT_INSTALL="apt-get install -y --no-install-recommends --no-install-suggests" && GIT_CL 0.0s => CACHED [3/6] RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_ 0.0s => CACHED [4/6] RUN /opt/conda/bin/mamba create -n dev python=3.10 && CONDA_INSTALL="/opt/conda/bin/mamba in 0.0s => CACHED [5/6] RUN PIP_INSTALL="/opt/conda/envs/dev/bin/pip install --no-cache-dir" && $PIP_INSTALL torch== 0.0s => [6/6] RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeedA sample docker file that works fine:
FROM nvidia/cuda:11.7.1-devel-ubuntu20.04 ARG DEBIAN_FRONTEND=noninteractive SHELL [ "/bin/bash","-c" ] ENV NVIDIA_VISIBLE_DEVICES all ENV NVIDIA_DRIVER_CAPABILITIES compute,video,utility RUN apt update -y \ && apt upgrade -y RUN apt install wget -y \ && apt install git -y \ && apt install libaio-dev -y \ && apt install libaio1 -y RUN apt install python3.9 -y \ && apt install python3-pip -y \ && apt install python-is-python3 -y RUN pip install --upgrade pip setuptools wheel RUN pip install ninja RUN pip install torch torchvision torchaudio RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 pip install git+https://github.com/microsoft/[email protected] CMD ["bash"]
Thanks for your solution. After I simply change RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir git+https://github.com/microsoft/DeepSpeed@master --global-option="build_ext" --global-option="-j8" to RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir deepspeed, it works well. Maybe something is wrong with conda.
For those coming to this issue while trying to build a wheel (install works yes, but does not build a wheel), instead of setup.py using pythons build module seems to make it work ,with CMAKE_POSITION_INDEPENDENT_CODE=ON and NVCC_PREPEND_FLAGS="--forward-unknown-opts" to pass gcc builf (fpic) problems and nvidia unknown ops. Note that last architecture depends on CUDA version on development environment.. E.g. CUDA 11.7 -> last supported = 8.7, CUDA 11.8 -> 8.9 (lovelace, CUDA 12 (9.0 , hopper)
DEEPSPEED_VERSION=v0.9.1
git clone https://github.com/microsoft/DeepSpeed.git \
     && cd DeepSpeed \
     && git checkout ${DEEPSPEED_VERSION} \
     && pip install --upgrade "pydantic<2.0.0" \
     && pip install build==0.10.0 \
     && CMAKE_POSITION_INDEPENDENT_CODE=ON NVCC_PREPEND_FLAGS="--forward-unknown-opts" CUDA_PATH=/usr/local/cuda TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6" DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 python -m build --wheel --no-isolation