DeepSpeed [BUG][0.6.7] garbage output for multi-gpu with tutorial

Describe the bug When running GPU = 2 started to see garbage output generated.

[{'generated_text': 'DeepSpeed is����極��極\\\\\\\\\\ \n\nの (  (  "\n090 nodot\x0c �\n �$, "\xa0\n\n \n\n \\ �\n �\n\n � �\n �osa\n\n � oldaran � � �aran======\\'}

To Reproduce I am running with 2 GPU instance with V100, also reproducible using A100.

Just follow this example: https://www.deepspeed.ai/tutorials/inference-tutorial/

# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py

Expected behavior

Should be just normal?

ds_report output

# ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.6.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

Screenshots

f': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 29, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is  {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
[2022-07-19 23:58:00,131] [INFO] [engine.py:144:__init__] Place model to device: 1
[2022-07-19 23:58:00,153] [INFO] [engine.py:144:__init__] Place model to device: 0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'DeepSpeed is����極��極\\\\\\\\\\ \n\nの (  (  "\n090 nodot\x0c �\n �$, "\xa0\n\n \n\n \\ �\n �\n\n � �\n �osa\n\n � oldaran � � �aran======\\'}]
[2022-07-19 23:58:04,674] [INFO] [launch.py:210:main] Process 811 exits successfully.
[2022-07-19 23:58:05,675] [INFO] [launch.py:210:main] Process 810 exits successfully.

System info (please complete the following information):

OS:Ubuntu 20.04
GPU count and types 2 V100
Python version 3.8

Jul 19 '22 23:07 lanking520

This is also reproducible on GPT-J-6B model if you simply switch it

Jul 20 '22 00:07 lanking520

I am also seeing this, but not with every model. I do see it when using the tutorial model as well though.

Jul 21 '22 22:07 zcrypt0

Thank you for reporting this! I've verified we can repro this on our side as well, but only when using >1 gpus. There's a gap currently in our CI tests for multi-gpu and certain models. We'll be fixing as soon as possible.

Jul 25 '22 17:07 jeffra

Hi @zcrypt0 and @lanking520 ,

Sorry for my delay! I just pushed a fix for this. Could you please try to see if the issue is fixed? Thanks, Reza

Aug 09 '22 02:08 RezaYazdaniAminabadi

@RezaYazdaniAminabadi

I installed from your PR commit. pip install git+https://github.com/microsoft/DeepSpeed@73fc0303bf723386df95be0e55259197e540506e

With the bigscience/bloom-350m model I don't see any change in the output, it still doesn't make any sense.

In fact, with that model, i see the issue even when using --num_gpus 1

I also tested the script that @lanking520 posted and I get the following error:

venv/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 415, in selfAttention_fp
    qkv_out = qkv_func(
RuntimeError: Fail to create cublas handle.

I double checked by reverting the deepspeed installation to master and the test script still gives that error, so it's possible its something in my environment, although other models seem to work.

Aug 09 '22 05:08 zcrypt0

@zcrypt0 I think this must be related to some issue on your CUDA driver/library, since you even did not pass the first phase of creating a CUBLAS handle. Could you please try reinstalling them? Thanks, Reza

Aug 09 '22 16:08 RezaYazdaniAminabadi

@RezaYazdaniAminabadi I just saw #2194 and am thinking my issue may be related to that as I ran on 1080tis.

I am going to test out this script on a set of ampere gpus and see how it goes.

EDIT: I installed from master and ran the script on an 2xA100s. This was the output.

[{'generated_text': 'DeepSpeed is a software house that makes software that solves very hard problems\n\n"Why we do what we do"\n\nIn most cases, Fastest.fm\'s original business plan was to monetize the\ncontent its users provided. This'}]

Aug 10 '22 21:08 zcrypt0

I was also getting junk output following the tutorial. I can confirm that after building DeepSpeed from master that the issue seems resolved from GPT Neo 2.7B.

I am however having another issue with regard to memory usage. Even when I specify torch.half(or torch.float16), the model seems to use the full VRAM on both GPUS. For example, running GPTJ on dual 3090s leads to OOM issues with usage over 24GB on each.

Also, and perhaps I am misunderstanding the use of this tool, but isnt the VRAM usage supposed to be split over the multiple GPUs? So I would expect roughly 6-7GB usage per GPU rather than 24GB for each.

I give more details here #2227

Aug 17 '22 04:08 mallorbc

Closing, the original issue is resolved and new issue is moved to #2227

Dec 02 '22 19:12 jeffra