DeepSpeed
DeepSpeed copied to clipboard
[BUG] Cannot run DeepSpeed with transformers on NVIDIA Tesla T4 GPU
Describe the bug Cannot run DeepSpeed with transformers on the ubuntu 20.04 with single GPU. GPU: NVIDIA T4
Error:
[2022-06-08 01:32:19,612] [INFO] [engine.py:132:__init__] Place model to device: 0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
!!!! kernel execution error. (m: 7680, n: 3, k: 2560, error: 13)
Traceback (most recent call last):
File "test.py", line 19, in <module>
string = generator("DeepSpeed is", do_sample=True, min_length=50)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 175, in __call__
return super().__call__(text_inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1026, in __call__
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1033, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 943, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 213, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, **generate_kwargs) # BS x SL
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 1310, in generate
return self.sample(
File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 1926, in sample
outputs = self(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 739, in forward
transformer_outputs = self.transformer(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 618, in forward
outputs = block(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 647, in forward
self.attention(input,
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 390, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 322, in forward
output, key_layer, value_layer, context_layer, inp_norm = selfAttention_fp()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 277, in selfAttention_fp
qkv_out = qkv_func(input,
ValueError: Specified device cuda:0 does not match device of data cuda:-2
To Reproduce
Create container
docker pull nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
nvidia-docker run -it -d nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
docker exec -it <your conatiner> /bin/bash
Inside container
sudo apt-get update
apt-get install python3-dev python3-pip vim curl
pip3 install --upgrade pip
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu113
pip3 install deepspeed transformers triton==1.0.0
Code to run
Grab the code from tutorial: https://www.deepspeed.ai/tutorials/inference-tutorial/#:~:text=DeepSpeed%2DInference%20introduces%20several%20features,to%20reduce%20latency%20for%20inference.
run
deepspeed --num_gpus 1 gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.float,
replace_method='auto',
replace_with_kernel_inject=True)
string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)
ds_report output
Please run ds_report
to give us details about your setup.
root@e463e31466e3:/# ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
Hi @lanking520, I just tried all your repro steps above and was not able to repro the stack trace. Can you confirm what transformers
version you are using? I tried w. 4.19.2. I don't currently have access to a T4 to test there but I tried on A6000, A100, and V100 and all ran okay. I am trying to get a T4 to test there and will report back.
Also, just to double check, you can run fine if you remove deepspeed.init_inference
right?
@jeffra Here is the pip list
root@b963730b133d:/# pip3 list
Package Version
------------------ ------------
certifi 2022.5.18.1
charset-normalizer 2.0.12
deepspeed 0.6.5
filelock 3.7.1
hjson 3.0.2
huggingface-hub 0.7.0
idna 3.3
ninja 1.10.2.3
numpy 1.22.4
packaging 21.3
pip 20.0.2
psutil 5.9.1
py-cpuinfo 8.0.0
pyparsing 3.0.9
PyYAML 6.0
regex 2022.6.2
requests 2.27.1
setuptools 45.2.0
tokenizers 0.12.1
torch 1.11.0+cu113
tqdm 4.64.0
transformers 4.19.2
triton 1.0.0
typing-extensions 4.2.0
urllib3 1.26.9
wheel 0.34.2
After commenting the deepspeed.init_inference
, the problem just gone
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'DeepSpeed is a simple, clean and powerful app for controlling the speed of your phone, tablet or laptop.\n\nIn the past we have added more features to the simple to use, clean and powerful SpeedControl app. For the first version we'}]
Do you think this might be related to the NVLink support?
@jeffra just tested NVIDIA V100 GPU series with 4GPU onboard, the code just works fine, which match to what you have tested. So I think we can just narrow down the issue to on NVIDIA Tesla T4 GPU compatibility.
Some interesting findings:
Output on single GPU:
[{'generated_text': 'DeepSpeed is the only company that can produce these kinds of parts, with a unique manufacturing process. Unlike other custom parts producers, they are able to produce parts at a much faster manufacturing pace and at a much more affordable cost.\n\nBy comparison'}]
Output on 4 GPUs doesn's smelled quite right:
[{'generated_text': 'DeepSpeed is ( " _ p _ ( _ just l R I n de better T \'!] l Pskinned or wrong _ better _"),],"],"skinned one S "], " good"), _ ( E R–!] set (– de T'}]
@lanking520,
Thanks for trying this on multiple GPUs, I will try this on my end to see what the issue is. I will let you know once I fix it. Regarding T4 machine, could you please try this with half-precision too and see if the issue persists? This kind of issue normally happens when there is some memory-allocation issue. I will look closer and let you know.
Best, Reza
@RezaYazdaniAminabadi
I'm observing the same behavior on an A100-SXM4-40GB system. Using the same steps to reproduce as @lanking520, the single GPU run works but multiple GPUs produces garbage output.
[{'generated_text': 'DeepSpeed is]) _ _ second pskinned S like _ just so de l first de wrongskinned set high n!]"] M so de]," better"). S not like or E en],\u2013 n better so]," better E]," de second?" P'}]
I've tried running different versions of torch and cuda without any change in behavior.
Update I'm also observing this behavior on V100-SXM2-16GB .
Just raised a separate issue here: https://github.com/microsoft/DeepSpeed/issues/2113 Still reproducible on 0.6.7
@jayargo did you managed to fix it by any chance?
@lanking520, I think there are two separate issues in this thread.
- single T4 inference w. deepspeed causing a kernel crash in your first post.
- multi-gpu inference producing garbage results for certain models.
Let's focus this issue on (1) and focus on (2) in #2113.
I finally have access to a T4 and I am still unable to reproduce the original issue. I've tried the same torch/deepspeed/transformers versions as you originally reported and also the latest version of each. I am also using your same docker container and setup.
Can you confirm that original issue is still reproduceable?
root@fd8d0eb38aec:/# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000001:00:00.0 Off | Off |
| N/A 37C P8 11W / 70W | 0MiB / 16127MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@fd8d0eb38aec:/# ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
root@fd8d0eb38aec:/# pip3 list
Package Version
------------------ ------------
certifi 2022.6.15
charset-normalizer 2.1.0
deepspeed 0.6.5
filelock 3.7.1
hjson 3.0.2
huggingface-hub 0.8.1
idna 3.3
ninja 1.10.2.3
numpy 1.23.1
packaging 21.3
pip 20.0.2
psutil 5.9.1
py-cpuinfo 8.0.0
pyparsing 3.0.9
PyYAML 6.0
regex 2022.7.25
requests 2.28.1
setuptools 45.2.0
tokenizers 0.12.1
torch 1.11.0+cu113
tqdm 4.64.0
transformers 4.19.2
triton 1.0.0
typing-extensions 4.3.0
urllib3 1.26.11
wheel 0.34.2
Here's the entire log of the run using your code snippet as well: https://gist.github.com/jeffra/b6966e155a57ec388444e13dd8b66402
The generated text is:
[{'generated_text': 'DeepSpeed is one of the most prominent and recognized companies in the world, with its presence on the world stage and a highly recognizable brand name. Fast forward twenty years and FastCompany.com has documented the evolution of FAST as a company, from'}]
@jeffra Just tested again with single GPU, the error is gone. I also tested multi-gpu and facing some NCCL issue. I think this might leads to the machine is not installed with NVLink. I will close this issue and lets leave the rest in #2113