DeepSpeed
DeepSpeed copied to clipboard
[BUG] DeepSpeed Inference reports Signal code: Integer divide-by-zero when Seq length is 4096 for GPT2
Describe the bug I am using DeepSpeed inference to run a GPT-2 model that has seq length 4096. It works fine for small input tokens. but is fails when I give it a input with shape like [1, 4000]. with the error of divide-by-zero. I traced the error a little bit, and the error comes from softmax_context_func .
To Reproduce Steps to reproduce the behavior:
- Simple inference script to reproduce
# file name: test_hg.py
import torch
import transformers
import deepspeed
import gc
from transformers import AutoConfig, AutoModelForCausalLM
deepspeed.init_distributed("nccl")
world_size = 1
config = AutoConfig.from_pretrained('./config.json')
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16)
model = model.eval()
torch.cuda.empty_cache()
model = deepspeed.init_inference(model,
mp_size=world_size,
dtype=torch.float16,
replace_method='auto',
replace_with_kernel_inject=True)
torch.cuda.empty_cache()
gc.collect()
model = model.module
batch = 1
seq_length = 4000
input_ids = torch.randint(1, 1000, (batch, seq_length), dtype=torch.int64)
attention_mask = torch.ones((batch, seq_length), dtype=torch.int64)
input_ids = input_ids.cuda()
attention_mask = attention_mask.cuda()
model(input_ids=input_ids, attention_mask=attention_mask)
config.json is
{
"_num_labels": 1,
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 1,
"embd_pdrop": 0.1,
"eos_token_id": 1,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"label2id": {
"LABEL_0": 0
},
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 4096,
"n_embd": 1920,
"n_head": 16,
"n_inner": null,
"n_layer": 24,
"n_positions": 4096,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"torch_dtype": "bfloat16",
"use_cache": false,
"vocab_size": 34176
}
-
What packages are required and their versions transformers 4.21.3
-
How to run the script For debugging purpose, I set the world_size to 1. So using only 1 GPU can reproduce is with command
python test_hg.py
Expected behavior It's expected to finish without any error
ds_report output
Please run ds_report to give us details about your setup.
$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/anaconda3/envs/mstart/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/home/ubuntu/dev/Mega/DeepSpeed/deepspeed']
deepspeed info ................... 0.8.3+457850dc, 457850dc, master
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6
Screenshots

System info (please complete the following information):
- OS: Ubuntu 18.04
- GPU count and types: one machine with x8 A100s ]
- DeepSpeed : 0.8.3+457850dc
- Hugging Face Transformers/Accelerate/etc. Transformers version 4.21.3
- Python version: 3.8.16
Hi all,
I find the root cause of this issue. It's caused by this line.
When input token length > 3840, ATTN_THREADS will be smaller than reduce_width. Then (total_count - 1) will be divided by zero.
I am thinking how to fix it. I haven't deep dive into the algorithm. An naive solution on top of my head might be as below. Want to see if it makes sense and also looking forward to better solutions. Thanks!
int grid_num;
if (ATTN_THREADS < reduce_width)
{
int factor = reduce_width / ATTN_THREADS;
grid_num=(total_count - 1) * (factor) + 1 ;
}
else{
grid_num=(total_count - 1) /(ATTN_THREADS / reduce_width) + 1 ;
}
dim3 grid_dim(grid_num);
@cmikeh2 Could you help to check if my fix is valid? or there may be other solutions. Thanks
@zhen-jia I also encountered this bug, fixed it by simply changing ATTN_THREADS to 512
Thank you both for looking into this. I've made a PR (https://github.com/microsoft/DeepSpeed/pull/3046) to clean up this scheduling code such that it should work for our full range of supported sequence lengths. If you have an opportunity to test this on the long sequence length inputs you have to make sure it fixes the issues you're seeing, it would be much appreciated. Thanks!
Thanks @mzusman and @cmikeh2. Will try it later.
Tested and It works on my side. Thanks @mzusman