DeepSpeed [BUG] Can't compile DeepSpeed version 0.8.1+ with Cuda 11.7

Describe the bug When I try to compile DeepSpeed from the 0.8.1 tag using docker and Cuda 11.7, the compilation fails. The docker image tries to compile DeepSpeed with the following command that has worked in the past:

DS_BUILD_OPS=1 pip install git+https://github.com/microsoft/[email protected]

To Reproduce Steps to reproduce the behavior:

Use a Linux machine
Use the Dockerfile here
Change line 30 of the file to be tag 0.8.1 rather than 0.8.0
Try building the docker image.

Expected behavior Though compilation takes over 10 minutes, it still has always worked in the past with previous tags. I expect that compilation finishes successfully rather than fail.

ds_report output Due to the compilation failing, I can not do this

Screenshots

After this, I get terminal output that suggests its still trying to build but then I get another error and the image fails to build

System info (please complete the following information):

Ubuntu 20.04
Two RTX 3090s
Using a docker image linked elsewhere in the issue

Docker context Shared Dockerfile elsewhere

Additional context I have seen similar issues with regard to Windows.

Feb 28 '23 21:02 mallorbc

Hi @mallorbc

Thanks for reporting this issue. I will try to see if I can repro this on my end. Thanks, Reza

Mar 01 '23 06:03 RezaYazdaniAminabadi

Also, can I ask which PyTorch version are you using here?

Mar 01 '23 06:03 RezaYazdaniAminabadi

@RezaYazdaniAminabadi Thanks for looking into this. Whatever the issue is, I have seen many reports of issues so there seems to be a new requirement to successfully install or an issue with 0.8.1 itself.

You can see what my setup is in the Dockerfile I link in the issue. To install Pytorch, I am running the following command:

pip install torch torchvision torchaudio

From the Pytorch getting started page, this will install 1.13.1 with Cuda 11.7. Looking at my installed packages with 0.8.0 I can confirm the version installed is indeed that.

See below for my pip list and ds_report(for 0.8.0 though)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.8.0+bf6b9802, bf6b9802, HEAD torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

Package Version

accelerate 0.15.0 aiohttp 3.8.3 aiosignal 1.3.1 appdirs 1.4.4 async-timeout 4.0.2 asyncio 3.4.3 attrs 22.2.0 certifi 2022.12.7 charset-normalizer 2.1.1 click 8.1.3 datasets 2.8.0 deepspeed 0.8.0+bf6b9802 deepspeed-mii 0.0.5+6116e98 dill 0.3.6 docker-pycreds 0.4.0 filelock 3.9.0 frozenlist 1.3.3 fsspec 2022.11.0 gitdb 4.0.10 GitPython 3.1.31 grpcio 1.51.3 grpcio-tools 1.51.3 hjson 3.1.0 huggingface-hub 0.11.1 idna 3.4 multidict 6.0.4 multiprocess 0.70.14 ninja 1.11.1 numpy 1.24.1 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 packaging 23.0 pandas 1.5.2 pathtools 0.1.2 Pillow 9.4.0 pip 20.0.2 protobuf 4.22.0 psutil 5.9.4 py-cpuinfo 9.0.0 pyarrow 10.0.1 pydantic 1.10.5 python-dateutil 2.8.2 pytz 2022.7 PyYAML 6.0 regex 2022.10.31 requests 2.28.1 responses 0.18.0 sentry-sdk 1.16.0 setproctitle 1.3.2 setuptools 45.2.0 six 1.16.0 smmap 5.0.0 tokenizers 0.13.2 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.25.1 triton 1.0.0 typing-extensions 4.4.0 urllib3 1.26.14 wandb 0.13.10 wheel 0.34.2 xxhash 3.2.0 yarl 1.8.2

Mar 03 '23 05:03 mallorbc

Tried to compile with version 0.8.2 and it still does not work.

Mar 13 '23 17:03 mallorbc

Read into this more. I guess I don't have to compile it this way and can just use JIT compilation. Still an odd issue.

Mar 13 '23 18:03 mallorbc

@mallorbc - Do things work fine with JIT compilation? You're just not able to build the ops? Especially since you can build the ops on 0.8.0 but not 0.8.1 or 0.8.2?

Mar 13 '23 18:03 loadams

@mallorbc - Do things work fine with JIT compilation? You're just not able to build the ops? Especially since you can build the ops on 0.8.0 but not 0.8.1 or 0.8.2?

I use DeepSpeed for finetuning large language models and for inference. I can confirm that finetuning works using JIT fo r 0.8.2. I will be testing inference soon. I will let you know if the inference works.

Mar 13 '23 20:03 mallorbc

Inference seems to work as well at least for fp16. Int8 does not work, but that has been an issue for a while. When I try to use int8 I get the following error:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13)

Free memory : 5.544067 (GigaBytes) Total memory: 23.691101 (GigaBytes) Requested memory: 1.375000 (GigaBytes) Setting maximum total tokens (input + output) to 2048

!!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13) !!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13) !!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13) !!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13) !!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13) File "/app/server.py", line 249, in generate gen_text = gpt_model(prompt, do_sample=do_sample, max_length=total_max_length,min_length=total_min_length,temperature=temp_input,top_k=top_k_input,top_p=top_p_input,early_stopping=early_stopping_input,bad_words_ids=bad_word_ids,batch_size=len(prompt),num_beams=num_beams,penalty_alpha=penalty_alpha) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 210, in call return super().call(text_inputs, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1065, in call outputs = [output for output in final_iterator] File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1065, in outputs = [output for output in final_iterator] File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 124, in next item = next(self.iterator) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 125, in next processed = self.infer(item, **self.params) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 992, in forward model_outputs = self._forward(model_inputs, **forward_params) File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 252, in _forward generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 588, in _generate return self.module.generate(*inputs, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1391, in generate return self.greedy_search( File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2179, in greedy_search outputs = self( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/models/gptj/modeling_gptj.py", line 836, in forward lm_logits = self.lm_head(hidden_states).to(torch.float32) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

Mar 13 '23 21:03 mallorbc

Makes sense, thanks. I'd probably recommend opening a new issue for the int8 dtype errors, so the right folks can look at that.

But for the build issues, that's odd and something we should root cause/fix. I was able to repro with your docker image. Specifically looks like DS_BUILD_STOCHASTIC_TRANSFORMER, DS_BUILD_TRANSFORMER_INFERENCE, and DS_BUILD_TRANSFORMER all have various different nvcc compilation problems.

@RezaYazdaniAminabadi - any issues you found or know of with this cuda/torch version? I'm inclined to test with another known good one to see if there is some issue here otherwise.

Mar 13 '23 22:03 loadams

@loadams Thanks for looking into this issue, I am guessing you built single ops at a time?

Is there a downside to using JIT? Perhaps it is slower than precompiling the ops?

I have an issue for int8 inference already open that one can see here #2956

I admit I may be mistaken that it is supposed to be a drop-in solution. If that is the case, clarity would be great.

Thanks!

Mar 14 '23 02:03 mallorbc

@mallorbc - correct, I was curious if there was one op that was the problem but seems to be a host of cuda type issues, which is why I was suspecting the version there wasn't one of the more supported ones.

No downsides to JIT really, it could be a bit slower on the first run, but in benchmarks we've not really seen a big difference.

Unfortunately I'm not sure on that bug, but they should reply quickly and get you an answer on that.

Mar 14 '23 14:03 loadams

Interesting it also occurs with Cuda 11.6 and 11.8. Linking this issue since it appears to be the same thing.

Mar 14 '23 15:03 loadams

@mallorbc - So its likely this is related, and this fix works for that user - we're compiling in a different environment than we are using, or at least can you try with that branch to see if that resolves the issue? I lost my previous repro. We're working on the right solution to this but don't have one yet.

Mar 29 '23 21:03 loadams

Specifically, this PR is the one that added changes that are causing the issues here. But you can work around this for now with the sample PR above, let me know if that works, and we will work on fixing the cross-compilation issue.

Mar 29 '23 22:03 loadams

@loadams makes sense why that may cause issues. I will try that PR of the fix when I get a chance and let you know what I find. Thanks!

Mar 30 '23 02:03 mallorbc

@mallorbc - following up, did that work for you? If so, we know cross-compilation is an outstanding issue, but I'd close this ticket for now and open an enhancement on our side for that.

Apr 10 '23 16:04 loadams

#3085 should be completed soon, that will resolve this issue, but we're also adding tests in #3277 to prevent this from regressing in the future.

Apr 18 '23 21:04 loadams

DeepSpeed DeepSpeed copied to clipboard

[BUG] Can't compile DeepSpeed version 0.8.1+ with Cuda 11.7

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

Free memory : 5.544067 (GigaBytes) Total memory: 23.691101 (GigaBytes) Requested memory: 1.375000 (GigaBytes) Setting maximum total tokens (input + output) to 2048

DeepSpeed
DeepSpeed copied to clipboard