DeepSpeed [BUG] Fail to compile Deepspeed 0.9.0 with CUDA 11.7 and PyTorch 1.13.1 with Docker.

Describe the bug I am building a docker image via Github Action, I installed pytorch 1.13.1 with cuda 11.7. Then when I am trying to install deepspeed 0.9.0 by DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir deepspeed --global-option="build_ext" --global-option="-j8", it fails with the error error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.

Since it only return nvcc error without no more information, I have no idea to correct it.

To Reproduce Steps to reproduce the behavior:

Build this Dockerfile https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile
See error

Expected behavior It would fail with error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.

ds_report output I can not run ds_report due to wrong compiling.

Screenshots I am building docker image via Github Action, the building log is avaiable at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4700207656/jobs/8334587489

System info (please complete the following information): I am building on Github Action paltform with ubuntu-latest environment, detailed workflow.yml can be found at https://github.com/chenyaofo/docker-image-open-builder/blob/main/.github/workflows/build.yml

Launcher context N/A

Docker context N/A

Additional context N/A

Apr 14 '23 13:04 chenyaofo

I was able to repro this using your Dockerfile, but I do see this error in the logs:

#8 439.8       g++ -pthread -B /opt/conda/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/envs/dev/lib -Wl,-rpath-link,/opt/conda/envs/dev/lib -L/opt/conda/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/envs/dev/lib -Wl,-rpath-link,/opt/conda/envs/dev/lib -L/opt/conda/envs/dev/lib build/temp.linux-x86_64-cpython-310/csrc/utils/flatten_unflatten.o -L/opt/conda/envs/dev/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/deepspeed/ops/utils_op.cpython-310-x86_64-linux-gnu.so
#8 439.8       /opt/conda/envs/dev/compiler_compat/ld: build/temp.linux-x86_64-cpython-310/csrc/utils/flatten_unflatten.o: relocation R_X86_64_TPOFF32 against hidden symbol `_ZZN8pybind116handle15inc_ref_counterEmE7counter' can not be used when making a shared object
#8 439.8       /opt/conda/envs/dev/compiler_compat/ld: failed to set dynamic section sizes: bad value
#8 439.8       collect2: error: ld returned 1 exit status

Is there a reason that you need to build the ops instead of JIT? We will want to understand what's causing this, but that might unblock you in the meantime. Or do you see the same error when using JIT?

Apr 14 '23 18:04 loadams

Also we do not have compatibility with triton 2.0.0 yet, so you may want to try building with pip install triton==1.0.0 to see if that resolves any issues as well.

Apr 14 '23 19:04 loadams

I was able to repro this using your Dockerfile, but I do see this error in the logs:

#8 439.8       g++ -pthread -B /opt/conda/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/envs/dev/lib -Wl,-rpath-link,/opt/conda/envs/dev/lib -L/opt/conda/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/opt/conda/envs/dev/lib -Wl,-rpath-link,/opt/conda/envs/dev/lib -L/opt/conda/envs/dev/lib build/temp.linux-x86_64-cpython-310/csrc/utils/flatten_unflatten.o -L/opt/conda/envs/dev/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/deepspeed/ops/utils_op.cpython-310-x86_64-linux-gnu.so
#8 439.8       /opt/conda/envs/dev/compiler_compat/ld: build/temp.linux-x86_64-cpython-310/csrc/utils/flatten_unflatten.o: relocation R_X86_64_TPOFF32 against hidden symbol `_ZZN8pybind116handle15inc_ref_counterEmE7counter' can not be used when making a shared object
#8 439.8       /opt/conda/envs/dev/compiler_compat/ld: failed to set dynamic section sizes: bad value
#8 439.8       collect2: error: ld returned 1 exit status

Is there a reason that you need to build the ops instead of JIT? We will want to understand what's causing this, but that might unblock you in the meantime. Or do you see the same error when using JIT?

I am building docker image for Deepspeed-Chat, if I running with JIT, that will be OK without any errors.

I am just curious why I can not pre-compile deepspeed in docker?

Apr 15 '23 05:04 chenyaofo

Also we do not have compatibility with triton 2.0.0 yet, so you may want to try building with pip install triton==1.0.0 to see if that resolves any issues as well.

I also try building with pip install triton==1.0.0, the dockerfile is at https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile-with-triton1.0.0 , the building logs is at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4706335613/jobs/8347496571 .

But the error seems to be the same, error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.

Apr 15 '23 06:04 chenyaofo

Yes, it should be fine to just use the JIT ops, pre-compiling isn't necessary. I'm not sure why it errors out yet, but wanted to be able to unblock in the meantime while I continue to look.

Apr 17 '23 15:04 loadams

This should be resolved if you can try with the latest master branch

Apr 19 '23 15:04 loadams

This should be resolved if you can try with the latest master branch

Following your suggestions, I try to build deepseed from lastest master branch, the dockerfile is at https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile-with-triton1.0.0-from-source , it seems still to fail with error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1, building github action is at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4745398839/jobs/8427660717

Apr 23 '23 07:04 chenyaofo

我发现，通过DS_BUILD_OPS=1参数去build的话会失败，于是我逐个op的去build。 DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_FUSED_LAMB=1 DS_BUILD_SPARSE_ATTN=1 DS_BUILD_UTILS=1 DS_BUILD_AIO=1 这些op加上都能build成功，一旦加上DS_BUILD_TRANSFORMER、DS_BUILD_TRANSFORMER_INFERENCE、DS_BUILD_STOCHASTIC_TRANSFORMER等XX_TRANSFORMER的op就会报build失败，具体为什么失败也看不出来。

另外，执行single_gpu时能正常运行，但是执行single_node时报错，如下： [2023-05-02 10:11:48,598] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2+b0d9c4d0, git-hash=b0d9c4d0, git-branch=master [2023-05-02 10:11:48,599] [INFO] [comm.py:616:init_distributed] Distributed backend already initialized [2023-05-02 10:11:49,907] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2465 [2023-05-02 10:11:50,162] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2466 [2023-05-02 10:11:50,163] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-125m', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--gradient_checkpointing', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', './output'] exits with return code = -7

NVIDIA-SMI 530.41.03 Driver Version: 531.41 CUDA Version: 12.1 NVIDIA GeForce RTX 3070 NVIDIA GeForce RTX 3060

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.2+b0d9c4d0, b0d9c4d0, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

May 02 '23 02:05 l241025097

我发现，通过DS_BUILD_OPS=1参数去build的话会失败，于是我逐个op的去build。 DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_FUSED_LAMB=1 DS_BUILD_SPARSE_ATTN=1 DS_BUILD_UTILS=1 DS_BUILD_AIO=1 这些op加上都能build成功，一旦加上DS_BUILD_TRANSFORMER、DS_BUILD_TRANSFORMER_INFERENCE、DS_BUILD_STOCHASTIC_TRANSFORMER等XX_TRANSFORMER的op就会报build失败，具体为什么失败也看不出来。

另外，执行single_gpu时能正常运行，但是执行single_node时报错，如下： [2023-05-02 10:11:48,598] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2+b0d9c4d0, git-hash=b0d9c4d0, git-branch=master [2023-05-02 10:11:48,599] [INFO] [comm.py:616:init_distributed] Distributed backend already initialized [2023-05-02 10:11:49,907] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2465 [2023-05-02 10:11:50,162] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2466 [2023-05-02 10:11:50,163] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-125m', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--gradient_checkpointing', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', './output'] exits with return code = -7

NVIDIA-SMI 530.41.03 Driver Version: 531.41 CUDA Version: 12.1 NVIDIA GeForce RTX 3070 NVIDIA GeForce RTX 3060

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at

runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja

ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY]

cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [YES] ...... [OKAY] DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.2+b0d9c4d0, b0d9c4d, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

https://github.com/microsoft/DeepSpeed/issues/2632根据这个，解决了“single_gpu时能正常运行，但是执行single_node时报错”的问题。原因是使用容器运行，shm默认为64M，造成多进程内存溢出。在run容器的时候加上--shm-size="16g"就解决了。

May 02 '23 09:05 l241025097

@l241025097 - this looks like a different issue than @chenyaofo's above. Could you open a new ticket for this? Since yours seems to be building enough to at least run ds_report although its not installing all ops, but that's another problem.

May 02 '23 15:05 loadams

@chenyaofo - do you have cuda-toolkit installed? Also since you're building on a node that isn't the one you're running on, so the capabilities of that are being used for the build since we do not support cross-compilation.

Would it be possible for you to try and build your docker image on the machine with the GPU you're using?

May 02 '23 15:05 loadams

@chenyaofo - I tried making my own dockerfile to test this, and I'm able to get the below working. I'm not familiar with the needs of your system, but I believe something in the conda setup isn't working properly with the build, since I'm also able to modify yours to work.

To modify yours to work, I've replaced the build line with this:

RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeed

This slows the build down slightly, but does compile properly.

Successful output is here:

[+] Building 982.1s (10/10) FINISHED
 => [internal] load .dockerignore                                                                                  0.0s
 => => transferring context: 2B                                                                                    0.0s
 => [internal] load build definition from Dockerfile                                                               0.1s
 => => transferring dockerfile: 1.68kB                                                                             0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04                                    0.0s
 => [1/6] FROM docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04                                                      0.0s
 => CACHED [2/6] RUN APT_INSTALL="apt-get install -y --no-install-recommends --no-install-suggests" &&     GIT_CL  0.0s
 => CACHED [3/6] RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_  0.0s
 => CACHED [4/6] RUN /opt/conda/bin/mamba create -n dev python=3.10 &&     CONDA_INSTALL="/opt/conda/bin/mamba in  0.0s
 => CACHED [5/6] RUN PIP_INSTALL="/opt/conda/envs/dev/bin/pip install --no-cache-dir" &&     $PIP_INSTALL torch==  0.0s
 => [6/6] RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeed

A sample docker file that works fine:

FROM nvidia/cuda:11.7.1-devel-ubuntu20.04

ARG DEBIAN_FRONTEND=noninteractive

SHELL [ "/bin/bash","-c" ]

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,video,utility

RUN apt update -y \
&& apt upgrade -y

RUN apt install wget -y \
&& apt install git -y \ 
&& apt install libaio-dev -y \
&& apt install libaio1 -y 

RUN apt install python3.9 -y \
&& apt install python3-pip -y \
&& apt install python-is-python3 -y

RUN pip install --upgrade pip setuptools wheel

RUN pip install ninja

RUN pip install torch torchvision torchaudio

RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 pip install git+https://github.com/microsoft/[email protected]

CMD ["bash"]

May 02 '23 18:05 loadams

Describe the bug I am building a docker image via Github Action, I installed pytorch 1.13.1 with cuda 11.7. Then when I am trying to install deepspeed 0.9.0 by DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir deepspeed --global-option="build_ext" --global-option="-j8", it fails with the error error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.

Since it only return nvcc error without no more information, I have no idea to correct it.

To Reproduce Steps to reproduce the behavior:
1. Build this Dockerfile https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile

2. See error
Expected behavior It would fail with error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.

ds_report output I can not run ds_report due to wrong compiling.

Screenshots I am building docker image via Github Action, the building log is avaiable at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4700207656/jobs/8334587489

System info (please complete the following information): I am building on Github Action paltform with ubuntu-latest environment, detailed workflow.yml can be found at https://github.com/chenyaofo/docker-image-open-builder/blob/main/.github/workflows/build.yml

Launcher context N/A

Docker context N/A

Additional context N/A

I encountered the same issues, and I built a new docker image to solve it. You can use it as following:

docker pull jockeyyan/deepspeed:torch113_cuda117v2.0

May 08 '23 02:05 jockeyyan

@chenyaofo - I tried making my own dockerfile to test this, and I'm able to get the below working. I'm not familiar with the needs of your system, but I believe something in the conda setup isn't working properly with the build, since I'm also able to modify yours to work.

To modify yours to work, I've replaced the build line with this:

RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeed

This slows the build down slightly, but does compile properly.

Successful output is here:

[+] Building 982.1s (10/10) FINISHED
 => [internal] load .dockerignore                                                                                  0.0s
 => => transferring context: 2B                                                                                    0.0s
 => [internal] load build definition from Dockerfile                                                               0.1s
 => => transferring dockerfile: 1.68kB                                                                             0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04                                    0.0s
 => [1/6] FROM docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04                                                      0.0s
 => CACHED [2/6] RUN APT_INSTALL="apt-get install -y --no-install-recommends --no-install-suggests" &&     GIT_CL  0.0s
 => CACHED [3/6] RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_  0.0s
 => CACHED [4/6] RUN /opt/conda/bin/mamba create -n dev python=3.10 &&     CONDA_INSTALL="/opt/conda/bin/mamba in  0.0s
 => CACHED [5/6] RUN PIP_INSTALL="/opt/conda/envs/dev/bin/pip install --no-cache-dir" &&     $PIP_INSTALL torch==  0.0s
 => [6/6] RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeed

A sample docker file that works fine:

FROM nvidia/cuda:11.7.1-devel-ubuntu20.04

ARG DEBIAN_FRONTEND=noninteractive

SHELL [ "/bin/bash","-c" ]

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,video,utility

RUN apt update -y \
&& apt upgrade -y

RUN apt install wget -y \
&& apt install git -y \ 
&& apt install libaio-dev -y \
&& apt install libaio1 -y 

RUN apt install python3.9 -y \
&& apt install python3-pip -y \
&& apt install python-is-python3 -y

RUN pip install --upgrade pip setuptools wheel

RUN pip install ninja

RUN pip install torch torchvision torchaudio

RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 pip install git+https://github.com/microsoft/[email protected]

CMD ["bash"]

Thanks for your solution. After I simply change RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir git+https://github.com/microsoft/DeepSpeed@master --global-option="build_ext" --global-option="-j8" to RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir deepspeed, it works well. Maybe something is wrong with conda.

May 08 '23 14:05 chenyaofo

For those coming to this issue while trying to build a wheel (install works yes, but does not build a wheel), instead of setup.py using pythons build module seems to make it work ,with CMAKE_POSITION_INDEPENDENT_CODE=ON and NVCC_PREPEND_FLAGS="--forward-unknown-opts" to pass gcc builf (fpic) problems and nvidia unknown ops. Note that last architecture depends on CUDA version on development environment.. E.g. CUDA 11.7 -> last supported = 8.7, CUDA 11.8 -> 8.9 (lovelace, CUDA 12 (9.0 , hopper)


DEEPSPEED_VERSION=v0.9.1
git clone https://github.com/microsoft/DeepSpeed.git \
     && cd DeepSpeed \
     && git checkout ${DEEPSPEED_VERSION} \
     && pip install --upgrade "pydantic<2.0.0" \
     && pip install build==0.10.0 \
     && CMAKE_POSITION_INDEPENDENT_CODE=ON NVCC_PREPEND_FLAGS="--forward-unknown-opts" CUDA_PATH=/usr/local/cuda TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6" DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 python -m build --wheel --no-isolation

Jul 30 '23 18:07 gorkemgoknar

DeepSpeed DeepSpeed copied to clipboard

[BUG] Fail to compile Deepspeed 0.9.0 with CUDA 11.7 and PyTorch 1.13.1 with Docker.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at

JIT compiled ops requires ninja

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY]

DeepSpeed
DeepSpeed copied to clipboard