DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]nvcc fatal when DS_BUILD_TRANSFORMER_INFERENCE=1

Open kuangdao opened this issue 1 year ago • 6 comments

i user pytorch 1.12 with cuda 11.6, and with ds config of

DS_BUILD_FUSED_ADAM=1 DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 DS_BUILD_TRANSFORMER=0 DS_BUILD_STOCHASTIC_TRANSFORMER=0 DS_BUILD_TRANSFORMER_INFERENCE=1 DS_BUILD_OPS=0

use pip3 install deepspeed --global-option="build_ext" --global-option="-j8"

and the finnal error is error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1

nvcc fatal : Unknown option '-Wno-reorder'

kuangdao avatar Apr 23 '23 09:04 kuangdao

What version of pip do you have installed? pip --version

mrwyattii avatar Apr 24 '23 17:04 mrwyattii

0.9.1 error bug 0.9.0 works

kuangdao avatar Apr 25 '23 02:04 kuangdao

@kuangdao you get this bug with deepspeed 0.9.1 but not with 0.9.0?

mrwyattii avatar Apr 27 '23 00:04 mrwyattii

I solved this by simply removed additional options, i.e. DS_BUILD_OPS=1 pip install deepspeed

gfzum avatar May 06 '23 02:05 gfzum

You doesn't solve the issue as you never build extensions any more.

jockeyyan avatar May 06 '23 07:05 jockeyyan

I got the same error.

torch                         1.13.1+cu117
torchaudio                    0.13.1+cu117
torchvision                   0.14.1+cu117

deepspeed==0.9.2 cuda==11.7

nvcc fatal : Unknown option '-Wno-reorder' nvcc fatal : Unknown option '-Wall'

hipudding avatar Jun 12 '23 07:06 hipudding

Same error with the latest Nvidia pytorch Docker image. It happens to me with both 0.9.0 and 0.9.4 (and presumably every version in between).

Nvidia driver version: 525.116.03. Using nvidia-container-runtime. RTX 6000 Ada GPU.

Minimal repro:

FROM nvcr.io/nvidia/pytorch:23.05-py3


RUN pip install --upgrade pip
RUN apt-get update
RUN apt-get install -y libaio-dev
ENV CUDA_HOME='/usr/local/cuda'
RUN pip install py-cpuinfo
# NOTE: do this: https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime
RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 pip install deepspeed==0.9.4 --global-option="build_ext" --global-option="-j32"

This seems to be a longstanding bug in pytorch, with many duplicate issues (https://github.com/pytorch/vision/issues/2001, https://github.com/pytorch/pytorch/issues/36378, https://github.com/pytorch/pytorch/issues/31283).

It seems to come from https://github.com/pytorch/pytorch/blob/15eed5b73ef1ebe0d1142d70bab7c20300a2aa2c/cmake/public/utils.cmake#L435

It's not clear to me why this bug doesn't affect everyone, which makes me think there is probably a workaround out there somewhere.

crclark avatar Jun 18 '23 17:06 crclark

Setting NVCC_PREPEND_FLAGS="--forward-unknown-opts" appears to be a workaround for this issue.

crclark avatar Jun 18 '23 18:06 crclark