DeepSpeed
DeepSpeed copied to clipboard
[BUG] [0.8.1] INT8 model loading/inference issue
Describe the bug
We conducted tests on OPT/GPTJ/GPT-Neox/BLOOM 7B INT8, these models are all producing garbage outputs on DeepSpeed 0.8.1
OPT model is NCCL communication issue
GPT-NeoX 20B is producing garbage
BLOOM-7B: shape '[1, 4, 32, 384]' is invalid for input of size 16384
How we tested? We generated int8 checkpoints of the model and then loaded them back. Example of doing the same with DS inference test suite.
deepspeed --num_nodes 1 \
--num_gpus 8 \
inference-test.py \
--use_kernel \
--ds_inference \
--use_meta_tensor \
--name EleutherAI/gpt-neox-20b \
--checkpoint_path /tmp/ws/gpt-neox-20b/ \
--save_mp_checkpoint_path /tmp/ws/sharded-gpt-neox-20b/ \
--dtype int8
deepspeed --num_nodes 1 \
--num_gpus 8 \
inference-test.py \
--use_kernel \
--ds_inference \
--use_meta_tensor \
--name EleutherAI/gpt-neox-20b \
--checkpoint_path /tmp/ws/sharded-gpt-neox-20b/ \
--dtype int8
More info this. https://github.com/microsoft/DeepSpeed/issues/2770
Creating a new issue to track the int8 checkpoint loading issue.
@HeyangQin
Hi @lanking520 @sindhuvahinis, PR https://github.com/microsoft/DeepSpeed/pull/2875 has been merged to address part of the issue. For now, the INT saving / loading is still not fully functional due to kernel issues. I would suggest you to use the workaround of saving checkpoints with fp32/fp16 and then load it with int8 to get around this issue for the time being.
Thanks for the info. Given the above context, at least the INT8 inference (load from FP16 ckpt) should work as expected?
@HeyangQin So I would assume developer should follow this path:
- Load a model (e.g GPT-NeoX20B) and save to DS_sharded checkpoint in FP16.
- Using ckpt loading on FP16 and set the dtype in init_inference to torch.int8
And this should work as expected. The only drawback is, developer may still facing runtime GPU OOM issue when converting FP16 to INT8 during runtime
@HeyangQin Load BLOOM model with FP16 checkpoint and then set dtype=int8 in init_inference not work : (
Could u please answer this issue: https://github.com/microsoft/DeepSpeed/issues/2923 , and I found some people face the same problem .
Just wanted to +1 this issue: At DeepSpeed 0.9.0
, using torch.int8
in
deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True)
raises errors for various models. Below is some code to quickly reproduce this problem with the small models GPT-neo-125m
, Bloom 560m
and gpt2
:
# run on NVIDIA A10G, CUDA Version 11.7, Python 3.9
from typing import Any
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
from transformers import AutoTokenizer, AutoModelForCausalLM # v4.28.1
import torch # v1.13.1
import deepspeed # v.0.9.0
def print_next_token(model: Any) -> None:
output = model(**inputs)
token_id = torch.argmax(output.logits[0][-1])
token = tokenizer.decode(token_id)
print(f"{token=}")
architecture = "gpt2"
# architecture = "EleutherAI/gpt-neo-125m"
# architecture = "bigscience/bloom-560m"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(architecture, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(architecture, low_cpu_mem_usage=True).to(device).eval()
inputs = tokenizer("George Washington was the first US", return_tensors="pt").to(device)
print_next_token(model) # prints ' president'
engine = deepspeed.init_inference(model, dtype= torch.int8, replace_with_kernel_inject=True)
print_next_token(engine.module) # -> error
Errors slightly differ, depending on the model:
gpt2 and gpt-neo-125m ->
!!!! kernel execution error. (m: 768, n: 6, k: 2304, error: 13)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
bloom-560m ->
!!!! kernel execution error. (m: 1024, n: 6, k: 3072, error: 13)
RuntimeError: shape '[1, 6, 16, 192]' is invalid for input of size 6144
Also wanted to point out that when using torch.int8
in deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True)
, this code line is called which skips running WeightQuantization(...).model_quantize(...)
and I am not sure if this is intended and related.
ccing you @RezaYazdaniAminabadi and @jeffra since you may have worked on this piece of code in this commit
Just wanted to +1 this issue: At DeepSpeed
0.9.0
, usingtorch.int8
in
deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True)
raises errors for various models. Below is some code to quickly reproduce this problem with the small models
GPT-neo-125m
,Bloom 560m
andgpt2
:# run on NVIDIA A10G, CUDA Version 11.7, Python 3.9 from typing import Any import os os.environ["CUDA_LAUNCH_BLOCKING"] = "1" from transformers import AutoTokenizer, AutoModelForCausalLM # v4.28.1 import torch # v1.13.1 import deepspeed # v.0.9.0 def print_next_token(model: Any) -> None: output = model(**inputs) token_id = torch.argmax(output.logits[0][-1]) token = tokenizer.decode(token_id) print(f"{token=}") architecture = "gpt2" # architecture = "EleutherAI/gpt-neo-125m" # architecture = "bigscience/bloom-560m" device = "cuda" tokenizer = AutoTokenizer.from_pretrained(architecture, use_fast=True) model = AutoModelForCausalLM.from_pretrained(architecture, low_cpu_mem_usage=True).to(device).eval() inputs = tokenizer("George Washington was the first US", return_tensors="pt").to(device) print_next_token(model) # prints ' president' engine = deepspeed.init_inference(model, dtype= torch.int8, replace_with_kernel_inject=True) print_next_token(engine.module) # -> error
Errors slightly differ, depending on the model:
gpt2 and gpt-neo-125m -> !!!! kernel execution error. (m: 768, n: 6, k: 2304, error: 13) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` bloom-560m -> !!!! kernel execution error. (m: 1024, n: 6, k: 3072, error: 13) RuntimeError: shape '[1, 6, 16, 192]' is invalid for input of size 6144
same bug: kernel execution error, the error code is 13 or 14 or 15
Also wanted to point out that when using
torch.int8
indeepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True)
, this code line is called which skips runningWeightQuantization(...).model_quantize(...)
and I am not sure if this is intended and related.ccing you @RezaYazdaniAminabadi and @jeffra since you may have worked on this piece of code in this commit
Simply adjusting the statement does not work :)
model = deepspeed.init_inference(
File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/__init__.py", line 342, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 161, in __init__
self._convert_to_dtype(config)
File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 524, in _convert_to_dtype
model, self.quantization_scales = quantizer.model_quantize(self.module, self.injection_dict,
File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/runtime/weight_quantizer.py", line 153, in model_quantize
return quantized_module, torch.cat(all_scales)
RuntimeError: torch.cat(): expected a non-empty list of Tensors
```