DeepSpeed [BUG] Illegal memory access CUDA error when using long sequences

Describe the bug Running a forward pass on a DeepSpeedTransformerInference layer, with a sequence length of ~1000 tokens, results in an illegal memory access CUDA error.

To Reproduce Here is a minimal reproducible example that shows the bug:

from deepspeed.ops.transformer import DeepSpeedInferenceConfig, DeepSpeedTransformerInference
import torch

torch.cuda.set_device(0)

hidden_size = 256
heads = 8
num_layers = 12
fp16 = True
layernorm_epsilon = 1e-5
deepspeed_config = DeepSpeedInferenceConfig(hidden_size=hidden_size,
                                            intermediate_size=hidden_size * 4,
                                            heads=heads,
                                            num_hidden_layers=num_layers,
                                            layer_norm_eps=layernorm_epsilon,
                                            # encoder_decoder=False,
                                            fp16=fp16,
                                            pre_layer_norm=True,
                                            stochastic_mode=False,
                                            scale_attention=True,
                                            triangular_masking=True,
                                            local_attention=False,
                                            window_size=256,
                                            )
transformer = DeepSpeedTransformerInference(config=deepspeed_config)
transformer.half()
new_state_dict = {k: 0.01*torch.ones(*v.shape, dtype=v.dtype, device=v.device)
                  for k,v in transformer.state_dict().items()}
transformer.load_state_dict(new_state_dict)
transformer.cuda()
device = list(transformer.parameters())[0].device

batch_size = 1
seq_len = 1000
inputs = torch.ones((batch_size, seq_len, hidden_size), dtype=torch.float16, device=device)
input_mask = torch.ones(*inputs.shape[:2], dtype=bool, device=device)

output, _ = transformer(
    input=inputs,
    input_mask=input_mask)

print(f"outupt: \n {output}")

Running the code resulted with the following exception

RuntimeError: CUDA error: an illegal memory access was encountered

Expected behavior I was expecting to get a correct output, without the excpetion.

ds_report output

[2022-06-28 10:35:33,425] [WARNING] [partition_parameters.py:60:<module>] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.8.0a0+1606899
torch cuda version ............... 11.1
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: a single A100 GPU
Python version: 3.8.5

Launcher context Launching directly using Python interpreter.

Additional context Maybe the bug is related to line 20 in csrc/transformer/inference/includes/custom_cuda_layers.h? It reads:

#define MAX_OUT_TOKES 1024

Jun 28 '22 10:06 tomeras91

@tomeras91 I can confirm that I'm able to reproduce this error. I don't think it has anything to do with MAX_OUT_TOKES. @RezaYazdaniAminabadi could you take a look at this?

Jun 28 '22 16:06 mrwyattii

Hi @tomeras91

Thanks for reporting this issue. I will look into this. @mrwyattii, thanks for reproducing this. Yes, I think the issue is probably somewhere else. Thanks, Reza

Jun 28 '22 16:06 RezaYazdaniAminabadi

Below is a possibly related bug. I added some sample code to reproduce this error for a GPT2 model on an NVidia A10G. Let me know @RezaYazdaniAminabadi @cmikeh2 if you think I should rather file a new issue.

Describe the bug

After initialising a GPT2 model from Huggingface with DeepSpeed, I can run inference on short sequences. But when using long sequences with e.g. 700 tokens, I get multiple warnings !!!! kernel execution error. (m: 700, n: 700, k: 64, error: 14) which culminate in a RuntimeError: CUDA error: an illegal memory access was encountered. I can thus not run inference for this model.

Important: I tested old versions and found that I do not encounter that problem for deepspeed versions <= 0.6.6

To Reproduce Steps to reproduce the behavior:

Install packages

pip install --upgrade pip
pip uninstall -y torch deepspeed transformers
pip install --upgrade torch==1.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install --upgrade deepspeed==0.7.0 transformers==4.21.1

Run


import os
import deepspeed
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = deepspeed.init_inference(model, replace_method='auto', replace_with_kernel_inject=True)

short_input = tokenizer("Hello, my dog is cute." * 1, return_tensors="pt").to("cuda")
long_input = tokenizer("Hello, my dog is cute." * 100, return_tensors="pt").to("cuda")
outputs = model(**short_input)  # this works fine
outputs = model(**long_input)  # this throws below error

See error

!!!! kernel execution error. (batch: 12, m: 700, n: 700, k: 64, error: 14) 
!!!! kernel execution error. (batch: 12, m: 64, n: 700, k: 700, error: 14) 
!!!! kernel execution error. (m: 768, n: 700, k: 768, error: 14) 
!!!! kernel execution error. (m: 3072, n: 700, k: 768, error: 14) 
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_21875/4257510811.py in <cell line: 1>()
----> 1 outputs = model(**long_input)
      2 token_id = torch.argmax(outputs.logits.squeeze()[-1]).item()
      3 print(tokenizer.decode(token_id), outputs.logits.mean())

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/inference/engine.py in forward(self, *inputs, **kwargs)
    528                     outputs = self._graph_replay(*inputs, **kwargs)
    529             else:
--> 530                 outputs = self.module(*inputs, **kwargs)
    531             #outputs = self.module(*inputs, **kwargs)
    532         return outputs

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1146             input = bw_hook.setup_input_hook(input)
   1147 
-> 1148         result = forward_call(*input, **kwargs)
   1149         if _global_forward_hooks or self._forward_hooks:
   1150             for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values()):

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1056         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   1057 
-> 1058         transformer_outputs = self.transformer(
   1059             input_ids,
   1060             past_key_values=past_key_values,

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions, output_hidden_states, return_dict)
    899                 )
    900             else:
--> 901                 outputs = block(
    902                     hidden_states,
    903                     layer_past=layer_past,

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py in forward(self, input, input_mask, attention_mask, head_mask, layer_past, get_key_value, get_present, encoder_output, enc_dec_attn_mask, encoder_hidden_states, encoder_attention_mask, use_cache, alibi, output_attentions)
    840             presents = (key, value)
    841             self.layer_past = presents if layer_past is None else None
--> 842             output = self.mlp(attention_output, input, inp_norm, self.attention.attn_ob)
    843 
    844             if not self.config.pre_layer_norm:

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py in forward(self, input, residual, residual_norm, bias)
    710 
    711     def forward(self, input, residual, residual_norm, bias):
--> 712         return DeepSpeedMLPFunction.apply(input,
    713                                           residual,
    714                                           residual_norm,

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py in forward(ctx, input, residual, residual_norm, bias, inter_w, inter_b, attn_nw, attn_nb, config, mp_group, output_b, output_w, q_scales, q_groups, merge_count, mlp_gemm_func, fused_gemm_gelu, vector_matmul_func, bias_residual_func)
    631                                              config.pre_layer_norm,
    632                                              config.mlp_after_attn)
--> 633                 output = vector_matmul_func(intermediate, output_w, False)
    634 
    635         inference_cuda_module.residual_add(

RuntimeError: CUDA error: an illegal memory access was encountered

Expected behavior I did not expect an error to be thrown, but rather the variable outputs to be filled

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu116
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

System info (please complete the following information):

OS: Debian (also Amazon Linux 2)
GPU count and types: 1 GPU A10G (AWS g5.xlarge notebook instance)
Python version: 3.8.12 (also with 3.9.12)

Launcher context Inside a Python notebook

Possibly related issues:

https://github.com/microsoft/DeepSpeed/issues/2030
https://github.com/microsoft/DeepSpeed/issues/1301

Aug 19 '22 19:08 trianxy

I also have encountered this error. Trying small inputs such as what the tutorial uses "DeepSpeed is" leads to normal results, but using significantly longer input leads to an illegal memory error.

I would try version 0.6.6 or earlier as suggested by @trianxy but I want to use long sequences with GPTJ and GPT Neo 2.7B and those had issues up until recently as can be seen in #2233 .

My build is the same as the one in that issue, just with DeepSpeed built from source shortly after the PR that fixed the issue.

Sep 06 '22 00:09 mallorbc

FYI @mallorbc , @tomeras91 , @RezaYazdaniAminabadi :

My related issue which I detailed above is fixed in this PR. More precisely, my issue does not appear when I install the commit 4abd455521965930d0e921de8afc0073ea7df9d1

Thanks for that fix!

Sep 14 '22 12:09 trianxy

FYI @mallorbc , @tomeras91 , @RezaYazdaniAminabadi :

My related issue which I detailed above is fixed in this PR. More precisely, my issue does not appear when I install the commit 4abd455

Thanks for that fix!

If I recall, I also tried building from that PR and had issues with poor outputs for GPT Neo and GPTJ. I believe one of the branches I built fixed the memory error but still gave garbage output for long inputs. Perhaps this is the one.

Perhaps I am remembering wrong though, I will try this again later and see if it fixed anything, but again I think I tried this already.

Thanks!

Sep 15 '22 16:09 mallorbc

Alright @mallorbc - let me know if you need any support with testing.

It's true that I have seen inconsistent behavior when trying different GPT architectures with different inputs, so it may be that not all cases have been fixed by mentioned PR. I did not run a lot of test cases with different architectures and inputs.

Sep 16 '22 11:09 trianxy

Hi everyone, I am encountering the same issues with a RoBERTa type model on which I ran 8 bit MoQ during training. When instantiating the inference engine using: engine = deepspeed.init_inference(deepspeed_trainer.model, dtype=torch.int8, quantization_setting=(False,64), replace_with_kernel_inject=True ) I get the same error than the one mentioned by @tomeras91 and @trianxy. Do you know when this issue may be fixed? This also raises a question I had: is it possible to use Deepspeed for training (typically for QAT) and infer using torch? I would like to use the quantized weights of my model trained using Deepspeed to create a quantized torch instance for inference. It is still unclear if/how I could do so.

Thank you for your help!

Oct 07 '22 15:10 EdouardVilain-Git

Hi @trianxy! Could you solve this issue?

Nov 28 '22 15:11 carlose2108

DeepSpeed DeepSpeed copied to clipboard

[BUG] Illegal memory access CUDA error when using long sequences

DeepSpeed
DeepSpeed copied to clipboard