DeepSpeed
DeepSpeed copied to clipboard
[BUG][0.6.7] garbage output for multi-gpu with tutorial
Describe the bug When running GPU = 2 started to see garbage output generated.
[{'generated_text': 'DeepSpeed is����極��極\\\\\\\\\\ \n\nの ( ( "\n090 nodot\x0c �\n �$, "\xa0\n\n \n\n \\ �\n �\n\n � �\n �osa\n\n � oldaran � � �aran======\\'}
To Reproduce I am running with 2 GPU instance with V100, also reproducible using A100.
Just follow this example: https://www.deepspeed.ai/tutorials/inference-tutorial/
# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.float,
replace_method='auto',
replace_with_kernel_inject=True)
string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)
deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py
Expected behavior
Should be just normal?
ds_report output
# ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.6.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
Screenshots
f': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is {'layer_id': 29, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is {'layer_id': 30, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
DeepSpeed Transformer Inference config is {'layer_id': 31, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': True, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
[2022-07-19 23:58:00,131] [INFO] [engine.py:144:__init__] Place model to device: 1
[2022-07-19 23:58:00,153] [INFO] [engine.py:144:__init__] Place model to device: 0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'DeepSpeed is����極��極\\\\\\\\\\ \n\nの ( ( "\n090 nodot\x0c �\n �$, "\xa0\n\n \n\n \\ �\n �\n\n � �\n �osa\n\n � oldaran � � �aran======\\'}]
[2022-07-19 23:58:04,674] [INFO] [launch.py:210:main] Process 811 exits successfully.
[2022-07-19 23:58:05,675] [INFO] [launch.py:210:main] Process 810 exits successfully.
System info (please complete the following information):
- OS:Ubuntu 20.04
- GPU count and types 2 V100
- Python version 3.8
This is also reproducible on GPT-J-6B model if you simply switch it
I am also seeing this, but not with every model. I do see it when using the tutorial model as well though.
Thank you for reporting this! I've verified we can repro this on our side as well, but only when using >1 gpus. There's a gap currently in our CI tests for multi-gpu and certain models. We'll be fixing as soon as possible.
Hi @zcrypt0 and @lanking520 ,
Sorry for my delay! I just pushed a fix for this. Could you please try to see if the issue is fixed? Thanks, Reza
@RezaYazdaniAminabadi
I installed from your PR commit.
pip install git+https://github.com/microsoft/DeepSpeed@73fc0303bf723386df95be0e55259197e540506e
With the bigscience/bloom-350m model I don't see any change in the output, it still doesn't make any sense.
In fact, with that model, i see the issue even when using --num_gpus 1
I also tested the script that @lanking520 posted and I get the following error:
venv/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 415, in selfAttention_fp
qkv_out = qkv_func(
RuntimeError: Fail to create cublas handle.
I double checked by reverting the deepspeed installation to master and the test script still gives that error, so it's possible its something in my environment, although other models seem to work.
@zcrypt0 I think this must be related to some issue on your CUDA driver/library, since you even did not pass the first phase of creating a CUBLAS handle. Could you please try reinstalling them? Thanks, Reza
@RezaYazdaniAminabadi I just saw #2194 and am thinking my issue may be related to that as I ran on 1080tis.
I am going to test out this script on a set of ampere gpus and see how it goes.
EDIT: I installed from master and ran the script on an 2xA100s. This was the output.
[{'generated_text': 'DeepSpeed is a software house that makes software that solves very hard problems\n\n"Why we do what we do"\n\nIn most cases, Fastest.fm\'s original business plan was to monetize the\ncontent its users provided. This'}]
I was also getting junk output following the tutorial. I can confirm that after building DeepSpeed from master that the issue seems resolved from GPT Neo 2.7B.
I am however having another issue with regard to memory usage. Even when I specify torch.half(or torch.float16), the model seems to use the full VRAM on both GPUS. For example, running GPTJ on dual 3090s leads to OOM issues with usage over 24GB on each.
Also, and perhaps I am misunderstanding the use of this tool, but isnt the VRAM usage supposed to be split over the multiple GPUs? So I would expect roughly 6-7GB usage per GPU rather than 24GB for each.
I give more details here #2227
Closing, the original issue is resolved and new issue is moved to #2227