DeepSpeed [BUG] Cannot run DeepSpeed with transformers on NVIDIA Tesla T4 GPU

Describe the bug Cannot run DeepSpeed with transformers on the ubuntu 20.04 with single GPU. GPU: NVIDIA T4

Error:

[2022-06-08 01:32:19,612] [INFO] [engine.py:132:__init__] Place model to device: 0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
!!!! kernel execution error. (m: 7680, n: 3, k: 2560, error: 13)
Traceback (most recent call last):
  File "test.py", line 19, in <module>
    string = generator("DeepSpeed is", do_sample=True, min_length=50)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 175, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1026, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1033, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 943, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 213, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, **generate_kwargs)  # BS x SL
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 1310, in generate
    return self.sample(
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 1926, in sample
    outputs = self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 739, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 618, in forward
    outputs = block(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 647, in forward
    self.attention(input,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 390, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 322, in forward
    output, key_layer, value_layer, context_layer, inp_norm = selfAttention_fp()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 277, in selfAttention_fp
    qkv_out = qkv_func(input,
ValueError: Specified device cuda:0 does not match device of data cuda:-2

To Reproduce

Create container

docker pull nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04

nvidia-docker run -it -d nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
docker exec -it <your conatiner>  /bin/bash

Inside container

sudo apt-get update
apt-get install python3-dev python3-pip vim curl
pip3 install --upgrade pip
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu113
pip3 install deepspeed transformers triton==1.0.0

Code to run

Grab the code from tutorial: https://www.deepspeed.ai/tutorials/inference-tutorial/#:~:text=DeepSpeed%2DInference%20introduces%20several%20features,to%20reduce%20latency%20for%20inference.

run

deepspeed --num_gpus 1 gpt-neo-2.7b-generation.py

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

ds_report output Please run ds_report to give us details about your setup.

root@e463e31466e3:/# ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

Jun 08 '22 01:06 lanking520

Hi @lanking520, I just tried all your repro steps above and was not able to repro the stack trace. Can you confirm what transformers version you are using? I tried w. 4.19.2. I don't currently have access to a T4 to test there but I tried on A6000, A100, and V100 and all ran okay. I am trying to get a T4 to test there and will report back.

Jun 08 '22 21:06 jeffra

Also, just to double check, you can run fine if you remove deepspeed.init_inference right?

Jun 08 '22 21:06 jeffra

@jeffra Here is the pip list

root@b963730b133d:/# pip3 list
Package            Version
------------------ ------------
certifi            2022.5.18.1
charset-normalizer 2.0.12
deepspeed          0.6.5
filelock           3.7.1
hjson              3.0.2
huggingface-hub    0.7.0
idna               3.3
ninja              1.10.2.3
numpy              1.22.4
packaging          21.3
pip                20.0.2
psutil             5.9.1
py-cpuinfo         8.0.0
pyparsing          3.0.9
PyYAML             6.0
regex              2022.6.2
requests           2.27.1
setuptools         45.2.0
tokenizers         0.12.1
torch              1.11.0+cu113
tqdm               4.64.0
transformers       4.19.2
triton             1.0.0
typing-extensions  4.2.0
urllib3            1.26.9
wheel              0.34.2

After commenting the deepspeed.init_inference, the problem just gone

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'DeepSpeed is a simple, clean and powerful app for controlling the speed of your phone, tablet or laptop.\n\nIn the past we have added more features to the simple to use, clean and powerful SpeedControl app. For the first version we'}]

Do you think this might be related to the NVLink support?

Jun 08 '22 23:06 lanking520

@jeffra just tested NVIDIA V100 GPU series with 4GPU onboard, the code just works fine, which match to what you have tested. So I think we can just narrow down the issue to on NVIDIA Tesla T4 GPU compatibility.

Some interesting findings:

Output on single GPU:

[{'generated_text': 'DeepSpeed is the only company that can produce these kinds of parts, with a unique manufacturing process. Unlike other custom parts producers, they are able to produce parts at a much faster manufacturing pace and at a much more affordable cost.\n\nBy comparison'}]

Output on 4 GPUs doesn's smelled quite right:

[{'generated_text': 'DeepSpeed is ( " _ p _ ( _ just l R I n de better T \'!] l Pskinned or wrong _ better _"),],"],"skinned one S "], " good"), _ ( E R–!] set (– de T'}]

Jun 08 '22 23:06 lanking520

@lanking520,

Thanks for trying this on multiple GPUs, I will try this on my end to see what the issue is. I will let you know once I fix it. Regarding T4 machine, could you please try this with half-precision too and see if the issue persists? This kind of issue normally happens when there is some memory-allocation issue. I will look closer and let you know.

Best, Reza

Jun 09 '22 00:06 RezaYazdaniAminabadi

@RezaYazdaniAminabadi

I'm observing the same behavior on an A100-SXM4-40GB system. Using the same steps to reproduce as @lanking520, the single GPU run works but multiple GPUs produces garbage output.

[{'generated_text': 'DeepSpeed is]) _ _ second pskinned S like _ just so de l first de wrongskinned set high n!]"] M so de]," better"). S not like or E en],\u2013 n better so]," better E]," de second?" P'}]

I've tried running different versions of torch and cuda without any change in behavior.

Update I'm also observing this behavior on V100-SXM2-16GB .

Jul 08 '22 11:07 jayargo

Just raised a separate issue here: https://github.com/microsoft/DeepSpeed/issues/2113 Still reproducible on 0.6.7

@jayargo did you managed to fix it by any chance?

Jul 20 '22 00:07 lanking520

@lanking520, I think there are two separate issues in this thread.

single T4 inference w. deepspeed causing a kernel crash in your first post.
multi-gpu inference producing garbage results for certain models.

Let's focus this issue on (1) and focus on (2) in #2113.

I finally have access to a T4 and I am still unable to reproduce the original issue. I've tried the same torch/deepspeed/transformers versions as you originally reported and also the latest version of each. I am also using your same docker container and setup.

Can you confirm that original issue is still reproduceable?

root@fd8d0eb38aec:/# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000001:00:00.0 Off |                  Off |
| N/A   37C    P8    11W /  70W |      0MiB / 16127MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

root@fd8d0eb38aec:/# ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

root@fd8d0eb38aec:/# pip3 list
Package            Version
------------------ ------------
certifi            2022.6.15
charset-normalizer 2.1.0
deepspeed          0.6.5
filelock           3.7.1
hjson              3.0.2
huggingface-hub    0.8.1
idna               3.3
ninja              1.10.2.3
numpy              1.23.1
packaging          21.3
pip                20.0.2
psutil             5.9.1
py-cpuinfo         8.0.0
pyparsing          3.0.9
PyYAML             6.0
regex              2022.7.25
requests           2.28.1
setuptools         45.2.0
tokenizers         0.12.1
torch              1.11.0+cu113
tqdm               4.64.0
transformers       4.19.2
triton             1.0.0
typing-extensions  4.3.0
urllib3            1.26.11
wheel              0.34.2

Here's the entire log of the run using your code snippet as well: https://gist.github.com/jeffra/b6966e155a57ec388444e13dd8b66402

The generated text is:

[{'generated_text': 'DeepSpeed is one of the most prominent and recognized companies in the world, with its presence on the world stage and a highly recognizable brand name. Fast forward twenty years and FastCompany.com has documented the evolution of FAST as a company, from'}]

Aug 05 '22 22:08 jeffra

@jeffra Just tested again with single GPU, the error is gone. I also tested multi-gpu and facing some NCCL issue. I think this might leads to the machine is not installed with NVLink. I will close this issue and lets leave the rest in #2113

Aug 08 '22 18:08 lanking520

DeepSpeed DeepSpeed copied to clipboard

[BUG] Cannot run DeepSpeed with transformers on NVIDIA Tesla T4 GPU

To Reproduce

Create container

Inside container

Code to run

DeepSpeed
DeepSpeed copied to clipboard