DeepSpeed [BUG] BF16 raises CUDA error on inference GPT2

trafficstars

Description I'm trying to run a GPT2 124m model using BF16 with optimized kernel. It keeps giving me a CUDA error which does not occur if I switch to FP16.

To Reproduce Here is my script:

from transformers import AutoModelForCausalLM
import deepspeed
import torch

if __name__ == "__main__":
    model_id = "gpt2"
    model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id)
    
    ds_config = {
        "tensor_parallel": {"tp_size": 1},
        "dtype": "bf16",
        "replace_with_kernel_inject": True,
        "replace_method": "auto",
    }
    ds_model = deepspeed.init_inference(model=model, config=ds_config)

    input_ids = torch.randint(0, 50257, (5, 256))
    ds_model.module.generate(input_ids.to(ds_model.module.device))

Here is the CUDA error:

Traceback (most recent call last):
  File "tmp.py", line 18, in <module>
    ds_model.module.generate(input_ids.to(ds_model.module.device))
  File "/home/wenhant/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wenhant/.local/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/home/wenhant/.local/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/home/wenhant/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wenhant/.local/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1070, in forward
    lm_logits = self.lm_head(hidden_states)
  File "/home/wenhant/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wenhant/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The script works perfectly if I switch to FP16. Does DeepSpeed support BF16 inference? Or is this a bug?

Mar 06 '23 20:03 Wenhan-Tan

Just made an issue for this at #2955 . I am pretty sure that bfloat16 is not currently supported. Float32, float16 and int8 are supported(though I have had issues with int8).

Take a look at my issue and please answer some of the questions I posed if you have any insight. Thanks!

Mar 06 '23 22:03 mallorbc

@Wenhan-Tan we do not currently support BF16 directly with deepspeed inference.

@molly-smith can we add an assertion in the ds-inference config to gracefully error out if someone tries this case?

Mar 13 '23 18:03 jeffra

@mallorbc @jeffra Thank you for clarifying!

Mar 14 '23 20:03 Wenhan-Tan

DeepSpeed DeepSpeed copied to clipboard

[BUG] BF16 raises CUDA error on inference GPT2

DeepSpeed
DeepSpeed copied to clipboard