DeepSpeed
DeepSpeed copied to clipboard
[BUG] Can't compile DeepSpeed version 0.8.1+ with Cuda 11.7
Describe the bug When I try to compile DeepSpeed from the 0.8.1 tag using docker and Cuda 11.7, the compilation fails. The docker image tries to compile DeepSpeed with the following command that has worked in the past:
DS_BUILD_OPS=1 pip install git+https://github.com/microsoft/[email protected]
To Reproduce Steps to reproduce the behavior:
- Use a Linux machine
- Use the Dockerfile here
- Change line 30 of the file to be tag 0.8.1 rather than 0.8.0
- Try building the docker image.
Expected behavior Though compilation takes over 10 minutes, it still has always worked in the past with previous tags. I expect that compilation finishes successfully rather than fail.
ds_report output Due to the compilation failing, I can not do this
Screenshots
After this, I get terminal output that suggests its still trying to build but then I get another error and the image fails to build
System info (please complete the following information):
- Ubuntu 20.04
- Two RTX 3090s
- Using a docker image linked elsewhere in the issue
Docker context Shared Dockerfile elsewhere
Additional context I have seen similar issues with regard to Windows.
Hi @mallorbc
Thanks for reporting this issue. I will try to see if I can repro this on my end. Thanks, Reza
Also, can I ask which PyTorch version are you using here?
@RezaYazdaniAminabadi Thanks for looking into this. Whatever the issue is, I have seen many reports of issues so there seems to be a new requirement to successfully install or an issue with 0.8.1 itself.
You can see what my setup is in the Dockerfile I link in the issue. To install Pytorch, I am running the following command:
pip install torch torchvision torchaudio
From the Pytorch getting started page, this will install 1.13.1 with Cuda 11.7. Looking at my installed packages with 0.8.0 I can confirm the version installed is indeed that.
See below for my pip list and ds_report(for 0.8.0 though)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.8.0+bf6b9802, bf6b9802, HEAD torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
Package Version
accelerate 0.15.0 aiohttp 3.8.3 aiosignal 1.3.1 appdirs 1.4.4 async-timeout 4.0.2 asyncio 3.4.3 attrs 22.2.0 certifi 2022.12.7 charset-normalizer 2.1.1 click 8.1.3 datasets 2.8.0 deepspeed 0.8.0+bf6b9802 deepspeed-mii 0.0.5+6116e98 dill 0.3.6 docker-pycreds 0.4.0 filelock 3.9.0 frozenlist 1.3.3 fsspec 2022.11.0 gitdb 4.0.10 GitPython 3.1.31 grpcio 1.51.3 grpcio-tools 1.51.3 hjson 3.1.0 huggingface-hub 0.11.1 idna 3.4 multidict 6.0.4 multiprocess 0.70.14 ninja 1.11.1 numpy 1.24.1 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 packaging 23.0 pandas 1.5.2 pathtools 0.1.2 Pillow 9.4.0 pip 20.0.2 protobuf 4.22.0 psutil 5.9.4 py-cpuinfo 9.0.0 pyarrow 10.0.1 pydantic 1.10.5 python-dateutil 2.8.2 pytz 2022.7 PyYAML 6.0 regex 2022.10.31 requests 2.28.1 responses 0.18.0 sentry-sdk 1.16.0 setproctitle 1.3.2 setuptools 45.2.0 six 1.16.0 smmap 5.0.0 tokenizers 0.13.2 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.25.1 triton 1.0.0 typing-extensions 4.4.0 urllib3 1.26.14 wandb 0.13.10 wheel 0.34.2 xxhash 3.2.0 yarl 1.8.2
Tried to compile with version 0.8.2 and it still does not work.
Read into this more. I guess I don't have to compile it this way and can just use JIT compilation. Still an odd issue.
@mallorbc - Do things work fine with JIT compilation? You're just not able to build the ops? Especially since you can build the ops on 0.8.0 but not 0.8.1 or 0.8.2?
@mallorbc - Do things work fine with JIT compilation? You're just not able to build the ops? Especially since you can build the ops on 0.8.0 but not 0.8.1 or 0.8.2?
I use DeepSpeed for finetuning large language models and for inference. I can confirm that finetuning works using JIT fo r 0.8.2. I will be testing inference soon. I will let you know if the inference works.
Inference seems to work as well at least for fp16. Int8 does not work, but that has been an issue for a while. When I try to use int8 I get the following error:
Setting pad_token_id
to eos_token_id
:50256 for open-end generation.
!!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13)
!!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13)
!!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13)
Free memory : 5.544067 (GigaBytes) Total memory: 23.691101 (GigaBytes) Requested memory: 1.375000 (GigaBytes) Setting maximum total tokens (input + output) to 2048
!!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13)
!!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13)
!!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13)
!!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13)
!!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13)
!!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13)
!!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13)
!!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13)
!!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13)
!!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13)
!!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 12288, error: 13)
!!!! kernel execution error. (m: 4, n: 4, k: 85, error: 13)
!!!! kernel execution error. (m: 85, n: 4, k: 4, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 4096, error: 13)
!!!! kernel execution error. (m: 4096, n: 4, k: 16384, error: 13)
!!!! kernel execution error. (m: 16384, n: 4, k: 4096, error: 13)
File "/app/server.py", line 249, in generate
gen_text = gpt_model(prompt, do_sample=do_sample, max_length=total_max_length,min_length=total_min_length,temperature=temp_input,top_k=top_k_input,top_p=top_p_input,early_stopping=early_stopping_input,bad_words_ids=bad_word_ids,batch_size=len(prompt),num_beams=num_beams,penalty_alpha=penalty_alpha)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 210, in call
return super().call(text_inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1065, in call
outputs = [output for output in final_iterator]
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1065, in cublasCreate(handle)
Makes sense, thanks. I'd probably recommend opening a new issue for the int8 dtype errors, so the right folks can look at that.
But for the build issues, that's odd and something we should root cause/fix. I was able to repro with your docker image. Specifically looks like DS_BUILD_STOCHASTIC_TRANSFORMER, DS_BUILD_TRANSFORMER_INFERENCE, and DS_BUILD_TRANSFORMER all have various different nvcc compilation problems.
@RezaYazdaniAminabadi - any issues you found or know of with this cuda/torch version? I'm inclined to test with another known good one to see if there is some issue here otherwise.
@loadams Thanks for looking into this issue, I am guessing you built single ops at a time?
Is there a downside to using JIT? Perhaps it is slower than precompiling the ops?
I have an issue for int8 inference already open that one can see here #2956
I admit I may be mistaken that it is supposed to be a drop-in solution. If that is the case, clarity would be great.
Thanks!
@mallorbc - correct, I was curious if there was one op that was the problem but seems to be a host of cuda type issues, which is why I was suspecting the version there wasn't one of the more supported ones.
No downsides to JIT really, it could be a bit slower on the first run, but in benchmarks we've not really seen a big difference.
Unfortunately I'm not sure on that bug, but they should reply quickly and get you an answer on that.
Interesting it also occurs with Cuda 11.6 and 11.8. Linking this issue since it appears to be the same thing.
@mallorbc - So its likely this is related, and this fix works for that user - we're compiling in a different environment than we are using, or at least can you try with that branch to see if that resolves the issue? I lost my previous repro. We're working on the right solution to this but don't have one yet.
Specifically, this PR is the one that added changes that are causing the issues here. But you can work around this for now with the sample PR above, let me know if that works, and we will work on fixing the cross-compilation issue.
@loadams makes sense why that may cause issues. I will try that PR of the fix when I get a chance and let you know what I find. Thanks!
@mallorbc - following up, did that work for you? If so, we know cross-compilation is an outstanding issue, but I'd close this ticket for now and open an enhancement on our side for that.
#3085 should be completed soon, that will resolve this issue, but we're also adding tests in #3277 to prevent this from regressing in the future.