DeepSpeed [BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU

Describe the bug

Thanks for releasing deepspeed-inference. I'm following tutorial, https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference and I want to do inference with half-precision by setting dtype=torch.half. However, when using Tesla P40, it seems not working correctly generating unmeaningful text such as [{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}]. As a side note, when I switched GPU to Tesla T4 with same environment setting and script, this issue was not observed (attached log in Additional context). Would be Pascal GPU not supported in deepspeed-inference?

To Reproduce

# Filename: gpt-neo-2.7b-generation-float16.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:04,866] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:35:04,866] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:35:06,221] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:35:06,221] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:35:06,221] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:35:06,221] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:35:59,433] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:35:59,434] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25583362579345703 seconds
[2022-08-08 13:36:00,342] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}]
[2022-08-08 13:36:14,299] [INFO] [launch.py:318:main] Process 33 exits successfully.

Expected behavior generated_text should be some meaningful text.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.12.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

System info (please complete the following information):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           On   | 00000001:00:00.0 Off |                  Off |
| N/A   22C    P8     9W / 250W |      0MiB / 24451MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Additional context When I switched GPU to Tesla T4, this issue was not observed.

$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:14,922] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:40:14,923] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:40:16,025] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:40:16,025] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:40:16,025] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:40:16,025] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:42:26,151] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:42:26,152] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25422143936157227 seconds
[2022-08-08 13:42:26,994] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is the result of his experiences with the U.S. Army. He served from 2000 to 2004 as a Combat Medic in Special Forces with the 2nd Platoon, 1st Sustainment Brigade. He also has served as a Fire'}]
[2022-08-08 13:42:39,178] [INFO] [launch.py:318:main] Process 29 exits successfully.

Aug 08 '22 05:08 wkkautas

This might be related to the weird output I see with bigscience/bloom-350m because I am using 1080ti in those tests.

Aug 10 '22 21:08 zcrypt0

@zcrypt0 the 350m model is one you shouldn't use. I believe it's not fully trained and will therefore output garbage. The model page was recently updated and HF suggests you use the 560m version:

Aug 11 '22 23:08 mrwyattii

@wkkautas Thanks for reporting your issue. I'll try to reproduce what you're seeing and report back

Aug 11 '22 23:08 mrwyattii

@mrwyattii I saw weird output with other models too, including the inference tutorial. I switched over to Volta now and everything is working well.

Aug 12 '22 06:08 zcrypt0

@wkkautas I am able to reproduce this error on P40. I also noted that MP>1 with FP16 is breaking here (but FP32 seems to work). We are working on a solution

Aug 18 '22 16:08 mrwyattii

Just removing these #if seems good. https://github.com/microsoft/DeepSpeed/blob/521d329b975de97ec0b52395f02bb32466b8dc35/csrc/transformer/inference/csrc/transform.cu#L93 Is there any reason to set cuda arch higher than 700? (P40 would be 610) Thank you,

Nov 09 '22 13:11 wkkautas

Hi @wkkautas,

This PR https://github.com/microsoft/DeepSpeed/pull/2574 should fix the issue you are seeing. If you have time, please try on your end to make sure it does work as expected. Thanks!

Dec 05 '22 18:12 cmikeh2

If you are still seeing this issue, please reopen.

Dec 09 '22 18:12 cmikeh2

DeepSpeed DeepSpeed copied to clipboard

[BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU

DeepSpeed
DeepSpeed copied to clipboard