DeepSpeed [BUG] [0.8.1] INT8 model loading/inference issue

Describe the bug

We conducted tests on OPT/GPTJ/GPT-Neox/BLOOM 7B INT8, these models are all producing garbage outputs on DeepSpeed 0.8.1

OPT model is NCCL communication issue
GPT-NeoX 20B is producing garbage
BLOOM-7B: shape '[1, 4, 32, 384]' is invalid for input of size 16384

How we tested? We generated int8 checkpoints of the model and then loaded them back. Example of doing the same with DS inference test suite.

deepspeed --num_nodes 1 \
    --num_gpus 8 \
    inference-test.py \
    --use_kernel \
    --ds_inference \
    --use_meta_tensor \
    --name EleutherAI/gpt-neox-20b \
    --checkpoint_path /tmp/ws/gpt-neox-20b/ \
    --save_mp_checkpoint_path /tmp/ws/sharded-gpt-neox-20b/ \
    --dtype int8

deepspeed --num_nodes 1 \
    --num_gpus 8     \
    inference-test.py     \
    --use_kernel     \
    --ds_inference     \
    --use_meta_tensor \
    --name EleutherAI/gpt-neox-20b     \
    --checkpoint_path /tmp/ws/sharded-gpt-neox-20b/ \
    --dtype int8

More info this. https://github.com/microsoft/DeepSpeed/issues/2770

Creating a new issue to track the int8 checkpoint loading issue.

Feb 22 '23 21:02 sindhuvahinis

@HeyangQin

Feb 22 '23 23:02 lanking520

Hi @lanking520 @sindhuvahinis, PR https://github.com/microsoft/DeepSpeed/pull/2875 has been merged to address part of the issue. For now, the INT saving / loading is still not fully functional due to kernel issues. I would suggest you to use the workaround of saving checkpoints with fp32/fp16 and then load it with int8 to get around this issue for the time being.

Feb 28 '23 20:02 HeyangQin

Thanks for the info. Given the above context, at least the INT8 inference (load from FP16 ckpt) should work as expected?

Feb 28 '23 20:02 lanking520

@HeyangQin So I would assume developer should follow this path:

Load a model (e.g GPT-NeoX20B) and save to DS_sharded checkpoint in FP16.
Using ckpt loading on FP16 and set the dtype in init_inference to torch.int8

And this should work as expected. The only drawback is, developer may still facing runtime GPU OOM issue when converting FP16 to INT8 during runtime

Feb 28 '23 20:02 lanking520

@HeyangQin Load BLOOM model with FP16 checkpoint and then set dtype=int8 in init_inference not work : (

Could u please answer this issue: https://github.com/microsoft/DeepSpeed/issues/2923 , and I found some people face the same problem .

Mar 16 '23 08:03 crazycth

Just wanted to +1 this issue: At DeepSpeed 0.9.0, using torch.int8 in

deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True)

raises errors for various models. Below is some code to quickly reproduce this problem with the small models GPT-neo-125m, Bloom 560m and gpt2:

# run on NVIDIA A10G, CUDA Version 11.7, Python 3.9

from typing import Any
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

from transformers import AutoTokenizer, AutoModelForCausalLM  # v4.28.1
import torch  # v1.13.1
import deepspeed  # v.0.9.0

def print_next_token(model: Any) -> None:
    output = model(**inputs)
    token_id = torch.argmax(output.logits[0][-1])
    token = tokenizer.decode(token_id)
    print(f"{token=}")

architecture = "gpt2"
# architecture = "EleutherAI/gpt-neo-125m"
# architecture = "bigscience/bloom-560m"

device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(architecture, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(architecture, low_cpu_mem_usage=True).to(device).eval()
inputs = tokenizer("George Washington was the first US", return_tensors="pt").to(device)

print_next_token(model) # prints ' president'

engine = deepspeed.init_inference(model, dtype= torch.int8, replace_with_kernel_inject=True)

print_next_token(engine.module) # -> error

Errors slightly differ, depending on the model:

gpt2 and gpt-neo-125m -> 
!!!! kernel execution error. (m: 768, n: 6, k: 2304, error: 13) 
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

bloom-560m -> 
!!!! kernel execution error. (m: 1024, n: 6, k: 3072, error: 13) 
RuntimeError: shape '[1, 6, 16, 192]' is invalid for input of size 6144

Apr 20 '23 10:04 trianxy

Also wanted to point out that when using torch.int8 in deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True), this code line is called which skips running WeightQuantization(...).model_quantize(...) and I am not sure if this is intended and related.

ccing you @RezaYazdaniAminabadi and @jeffra since you may have worked on this piece of code in this commit

Apr 20 '23 10:04 trianxy

Just wanted to +1 this issue: At DeepSpeed 0.9.0, using torch.int8 in

deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True)

raises errors for various models. Below is some code to quickly reproduce this problem with the small models GPT-neo-125m, Bloom 560m and gpt2:

# run on NVIDIA A10G, CUDA Version 11.7, Python 3.9

from typing import Any
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

from transformers import AutoTokenizer, AutoModelForCausalLM  # v4.28.1
import torch  # v1.13.1
import deepspeed  # v.0.9.0

def print_next_token(model: Any) -> None:
    output = model(**inputs)
    token_id = torch.argmax(output.logits[0][-1])
    token = tokenizer.decode(token_id)
    print(f"{token=}")

architecture = "gpt2"
# architecture = "EleutherAI/gpt-neo-125m"
# architecture = "bigscience/bloom-560m"

device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(architecture, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(architecture, low_cpu_mem_usage=True).to(device).eval()
inputs = tokenizer("George Washington was the first US", return_tensors="pt").to(device)

print_next_token(model) # prints ' president'

engine = deepspeed.init_inference(model, dtype= torch.int8, replace_with_kernel_inject=True)

print_next_token(engine.module) # -> error

Errors slightly differ, depending on the model:

gpt2 and gpt-neo-125m -> 
!!!! kernel execution error. (m: 768, n: 6, k: 2304, error: 13) 
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

bloom-560m -> 
!!!! kernel execution error. (m: 1024, n: 6, k: 3072, error: 13) 
RuntimeError: shape '[1, 6, 16, 192]' is invalid for input of size 6144

same bug: kernel execution error, the error code is 13 or 14 or 15

Jun 17 '23 06:06 Moran232

Also wanted to point out that when using torch.int8 in deepspeed.init_inference(model, dtype=torch.int8, replace_with_kernel_inject=True), this code line is called which skips running WeightQuantization(...).model_quantize(...) and I am not sure if this is intended and related.

ccing you @RezaYazdaniAminabadi and @jeffra since you may have worked on this piece of code in this commit

Simply adjusting the statement does not work :)

model = deepspeed.init_inference(
  File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 161, in __init__
    self._convert_to_dtype(config)
  File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 524, in _convert_to_dtype
    model, self.quantization_scales = quantizer.model_quantize(self.module, self.injection_dict,
  File "/home/a/miniforge3/envs/llm_bench/lib/python3.9/site-packages/deepspeed/runtime/weight_quantizer.py", line 153, in model_quantize
    return quantized_module, torch.cat(all_scales)
RuntimeError: torch.cat(): expected a non-empty list of Tensors
```

Aug 02 '23 15:08 SebastianBodza

DeepSpeed DeepSpeed copied to clipboard

[BUG] [0.8.1] INT8 model loading/inference issue

DeepSpeed
DeepSpeed copied to clipboard