DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] DeepSpeed Inference reports Signal code: Integer divide-by-zero when Seq length is 4096 for GPT2

Open zhen-jia opened this issue 2 years ago • 2 comments

Describe the bug I am using DeepSpeed inference to run a GPT-2 model that has seq length 4096. It works fine for small input tokens. but is fails when I give it a input with shape like [1, 4000]. with the error of divide-by-zero. I traced the error a little bit, and the error comes from softmax_context_func .

To Reproduce Steps to reproduce the behavior:

  1. Simple inference script to reproduce
# file name: test_hg.py
import torch
import transformers
import deepspeed
import gc
from transformers import AutoConfig, AutoModelForCausalLM

deepspeed.init_distributed("nccl")
world_size = 1
config = AutoConfig.from_pretrained('./config.json')
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16)
model = model.eval()
torch.cuda.empty_cache()


model = deepspeed.init_inference(model,
                                 mp_size=world_size,
                                 dtype=torch.float16,
                                 replace_method='auto',
                                 replace_with_kernel_inject=True)

torch.cuda.empty_cache()
gc.collect()

model = model.module
batch = 1
seq_length = 4000

input_ids = torch.randint(1, 1000, (batch, seq_length), dtype=torch.int64)
attention_mask = torch.ones((batch, seq_length), dtype=torch.int64)

input_ids = input_ids.cuda()
attention_mask = attention_mask.cuda()


model(input_ids=input_ids, attention_mask=attention_mask)

config.json is

{
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 1,
  "embd_pdrop": 0.1,
  "eos_token_id": 1,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 4096,
  "n_embd": 1920,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "n_positions": 4096,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "bfloat16",
  "use_cache": false,
  "vocab_size": 34176
}

  1. What packages are required and their versions transformers 4.21.3

  2. How to run the script For debugging purpose, I set the world_size to 1. So using only 1 GPU can reproduce is with command python test_hg.py

Expected behavior It's expected to finish without any error

ds_report output Please run ds_report to give us details about your setup.

$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/anaconda3/envs/mstart/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/home/ubuntu/dev/Mega/DeepSpeed/deepspeed']
deepspeed info ................... 0.8.3+457850dc, 457850dc, master
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6


Screenshots image

System info (please complete the following information):

  • OS: Ubuntu 18.04
  • GPU count and types: one machine with x8 A100s ]
  • DeepSpeed : 0.8.3+457850dc
  • Hugging Face Transformers/Accelerate/etc. Transformers version 4.21.3
  • Python version: 3.8.16

zhen-jia avatar Mar 09 '23 18:03 zhen-jia

Hi all, I find the root cause of this issue. It's caused by this line. When input token length > 3840, ATTN_THREADS will be smaller than reduce_width. Then (total_count - 1) will be divided by zero.

I am thinking how to fix it. I haven't deep dive into the algorithm. An naive solution on top of my head might be as below. Want to see if it makes sense and also looking forward to better solutions. Thanks!

      int grid_num;
      if (ATTN_THREADS < reduce_width)
      {
         int factor = reduce_width / ATTN_THREADS;
         grid_num=(total_count - 1) * (factor) + 1 ;
      }
      else{
         grid_num=(total_count - 1) /(ATTN_THREADS / reduce_width) + 1 ;
      }
      dim3 grid_dim(grid_num);

zhen-jia avatar Mar 10 '23 18:03 zhen-jia

@cmikeh2 Could you help to check if my fix is valid? or there may be other solutions. Thanks

zhen-jia avatar Mar 10 '23 21:03 zhen-jia

@zhen-jia I also encountered this bug, fixed it by simply changing ATTN_THREADS to 512

mzusman avatar Mar 17 '23 10:03 mzusman

Thank you both for looking into this. I've made a PR (https://github.com/microsoft/DeepSpeed/pull/3046) to clean up this scheduling code such that it should work for our full range of supported sequence lengths. If you have an opportunity to test this on the long sequence length inputs you have to make sure it fixes the issues you're seeing, it would be much appreciated. Thanks!

cmikeh2 avatar Mar 17 '23 15:03 cmikeh2

Thanks @mzusman and @cmikeh2. Will try it later.

zhen-jia avatar Mar 20 '23 23:03 zhen-jia

Tested and It works on my side. Thanks @mzusman

zhen-jia avatar Mar 22 '23 17:03 zhen-jia