DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU

Open wkkautas opened this issue 3 years ago • 5 comments

Describe the bug

Thanks for releasing deepspeed-inference. I'm following tutorial, https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference and I want to do inference with half-precision by setting dtype=torch.half. However, when using Tesla P40, it seems not working correctly generating unmeaningful text such as [{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}]. As a side note, when I switched GPU to Tesla T4 with same environment setting and script, this issue was not observed (attached log in Additional context). Would be Pascal GPU not supported in deepspeed-inference?

To Reproduce

# Filename: gpt-neo-2.7b-generation-float16.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)
$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:04,866] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:35:04,866] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:35:06,221] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:35:06,221] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:35:06,221] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:35:06,221] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:35:59,433] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:35:59,434] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25583362579345703 seconds
[2022-08-08 13:36:00,342] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}]
[2022-08-08 13:36:14,299] [INFO] [launch.py:318:main] Process 33 exits successfully.

Expected behavior generated_text should be some meaningful text.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.12.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

System info (please complete the following information):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           On   | 00000001:00:00.0 Off |                  Off |
| N/A   22C    P8     9W / 250W |      0MiB / 24451MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Additional context When I switched GPU to Tesla T4, this issue was not observed.

$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:14,922] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:40:14,923] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:40:16,025] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:40:16,025] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:40:16,025] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:40:16,025] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:42:26,151] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:42:26,152] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25422143936157227 seconds
[2022-08-08 13:42:26,994] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is the result of his experiences with the U.S. Army. He served from 2000 to 2004 as a Combat Medic in Special Forces with the 2nd Platoon, 1st Sustainment Brigade. He also has served as a Fire'}]
[2022-08-08 13:42:39,178] [INFO] [launch.py:318:main] Process 29 exits successfully.

wkkautas avatar Aug 08 '22 05:08 wkkautas

This might be related to the weird output I see with bigscience/bloom-350m because I am using 1080ti in those tests.

zcrypt0 avatar Aug 10 '22 21:08 zcrypt0

@zcrypt0 the 350m model is one you shouldn't use. I believe it's not fully trained and will therefore output garbage. The model page was recently updated and HF suggests you use the 560m version: image

mrwyattii avatar Aug 11 '22 23:08 mrwyattii

@wkkautas Thanks for reporting your issue. I'll try to reproduce what you're seeing and report back

mrwyattii avatar Aug 11 '22 23:08 mrwyattii

@mrwyattii I saw weird output with other models too, including the inference tutorial. I switched over to Volta now and everything is working well.

zcrypt0 avatar Aug 12 '22 06:08 zcrypt0

@wkkautas I am able to reproduce this error on P40. I also noted that MP>1 with FP16 is breaking here (but FP32 seems to work). We are working on a solution

mrwyattii avatar Aug 18 '22 16:08 mrwyattii

Just removing these #if seems good. https://github.com/microsoft/DeepSpeed/blob/521d329b975de97ec0b52395f02bb32466b8dc35/csrc/transformer/inference/csrc/transform.cu#L93 Is there any reason to set cuda arch higher than 700? (P40 would be 610) Thank you,

wkkautas avatar Nov 09 '22 13:11 wkkautas

Hi @wkkautas,

This PR https://github.com/microsoft/DeepSpeed/pull/2574 should fix the issue you are seeing. If you have time, please try on your end to make sure it does work as expected. Thanks!

cmikeh2 avatar Dec 05 '22 18:12 cmikeh2

If you are still seeing this issue, please reopen.

cmikeh2 avatar Dec 09 '22 18:12 cmikeh2