DeepSpeed
DeepSpeed copied to clipboard
[BUG] BF16 raises CUDA error on inference GPT2
Description I'm trying to run a GPT2 124m model using BF16 with optimized kernel. It keeps giving me a CUDA error which does not occur if I switch to FP16.
To Reproduce Here is my script:
from transformers import AutoModelForCausalLM
import deepspeed
import torch
if __name__ == "__main__":
model_id = "gpt2"
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id)
ds_config = {
"tensor_parallel": {"tp_size": 1},
"dtype": "bf16",
"replace_with_kernel_inject": True,
"replace_method": "auto",
}
ds_model = deepspeed.init_inference(model=model, config=ds_config)
input_ids = torch.randint(0, 50257, (5, 256))
ds_model.module.generate(input_ids.to(ds_model.module.device))
Here is the CUDA error:
Traceback (most recent call last):
File "tmp.py", line 18, in <module>
ds_model.module.generate(input_ids.to(ds_model.module.device))
File "/home/wenhant/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/wenhant/.local/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/home/wenhant/.local/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/home/wenhant/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wenhant/.local/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1070, in forward
lm_logits = self.lm_head(hidden_states)
File "/home/wenhant/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wenhant/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The script works perfectly if I switch to FP16. Does DeepSpeed support BF16 inference? Or is this a bug?
Just made an issue for this at #2955 . I am pretty sure that bfloat16 is not currently supported. Float32, float16 and int8 are supported(though I have had issues with int8).
Take a look at my issue and please answer some of the questions I posed if you have any insight. Thanks!
@Wenhan-Tan we do not currently support BF16 directly with deepspeed inference.
@molly-smith can we add an assertion in the ds-inference config to gracefully error out if someone tries this case?
@mallorbc @jeffra Thank you for clarifying!