DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Pythia (GPT-NeoX based) models degrade in generation quality using DeepSpeed Inference

Open tokestermw opened this issue 1 year ago • 2 comments

Describe the bug Hi, the GPT-NeoX based Pythia model's generation quality is degraded once optimized with DeepSpeed.

Edit (2023/03/07): related to https://github.com/microsoft/DeepSpeed/issues/2777.

To Reproduce

from transformers import pipeline
import deepspeed
import torch

prompt = "As the final model release of GPT-2’s staged release, we’re releasing the largest version (1.5B parameters) of GPT-2 along with code and model weights to facilitate detection of outputs of GPT-2 models. While there have been larger language models released since August, we’ve continued with our original staged release plan in order to provide the community with a test case of a full staged release process. We hope that this test case will be useful to developers of future powerful models, and we’re actively continuing the conversation with the AI community on responsible publication."

gpt2_pipe = pipeline('text-generation', 'gpt2', device=0)
gpt2_pipe(prompt, max_new_tokens=50, return_full_text=False, do_sample=False)

# [{'generated_text': '\n\nWe are also pleased to announce that the GPT-2 model release is now available for download on the GPT-2 website.\n\nThe GPT-2 model release is available for download on the GPT-2 website.'}]

gpt2_pipe.model = deepspeed.init_inference(gpt2_pipe.model, replace_with_kernel_inject =True, replace_method='auto', dtype=torch.half, enable_cuda_graph=False)
gpt2_pipe(prompt, max_new_tokens=50, return_full_text=False, do_sample=False)

# [{'generated_text': '\n\nWe are also pleased to announce that the GPT-2 model release is now available for download on the GPT-2 website.\n\nThe GPT-2 model release is available for download on the GPT-2 website.'}]

pythia_pipe =  pipeline('text-generation', 'EleutherAI/pythia-125m-deduped', device=0)
pythia_pipe(prompt, max_new_tokens=50, return_full_text=False, do_sample=False)

# [{'generated_text': '\n\nWe’re also working on a new model release for GPT-2, which will be released in the next few weeks. We’re working on a new model release for GPT-2, which will be released in the next'}]

pythia_pipe.model = deepspeed.init_inference(pythia_pipe.model, replace_with_kernel_inject =True, replace_method='auto', dtype=torch.half, enable_cuda_graph=False)
pythia_pipe(prompt, max_new_tokens=50, return_full_text=False, do_sample=False)

# [{'generated_text': '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nof the GTS of the G. The GTS of the G. The G. The G. The G. The G. The G. The G. The'}]

Expected behavior Get the same generation results w/ and w/o DeepSpeed for Pythia models.

ds_report output

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/venv/lib/python3.7/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/root/venv/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.8.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types 1 A10G (AWS g5.xlarge)
  • transformers 4.26.1
  • Python version 3.7

Docker context 11.7.1-cudnn8-devel-ubuntu20.04

tokestermw avatar Feb 19 '23 01:02 tokestermw

Hi @satpalsr , I've updated to DeepSpeed 0.8.2, but I'm getting the same results:

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
[2023-03-17 06:03:09,070] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-17 06:03:09,071] [WARNING] [config_utils.py:77:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-03-17 06:03:09,071] [INFO] [logging.py:77:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Out[1]: [{'generated_text': '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nof the GTS of the G. The GTS of the G. The G. The G. The G. The G. The G. The G. The'}]

ds_report info:

async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/venv/lib/python3.7/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/root/venv/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.8.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

tokestermw avatar Mar 17 '23 06:03 tokestermw

Also running into this issue.

Yard1 avatar Apr 18 '23 19:04 Yard1

Also running into this issue.

Code to reproduce (run with deepspeed --num_gpus=1) DeepSpeed version is 0.9.2.

model_name = "EleutherAI/pythia-70m-deduped"


model = (
    AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
    )
    .eval()
    .to("cuda")
)


tokenizer_loading_params = {"padding_side": "left"}

tokenizer = AutoTokenizer.from_pretrained(model_name, **tokenizer_loading_params)
tokenizer.pad_token = tokenizer.eos_token

input = [
    ["What is the boiling point of water?", "My name is Lewis and I like to"],
    [
        "The best restaurant in San Francisco is",
        "Q: What are the steps needed to bake a cake?",
    ],
]

hf_result = []
hf_input = []
for batch in input:
    batch_input = tokenizer(
        batch, return_tensors="pt", truncation=True, padding=True
    ).to("cuda")
    hf_input.append(batch_input)
    tokens = model.generate(
        **batch_input,
        num_beams=1,
        max_new_tokens=32,
        early_stopping=False,
        repetition_penalty=2.0,
    )
    hf_result.extend(tokenizer.batch_decode(tokens, skip_special_tokens=True))

default_ds_config = {
    "dtype": torch.float16,
    "tensor_parallel": {
        "enabled": True,
        "tp_size": 1,
    },
    "replace_with_kernel_inject": True,
}

# Deepspeed
ds_result = []
ds_input = []
deepspeed.init_distributed()
ds_model = deepspeed.init_inference(model=model, config=default_ds_config).eval()

for batch in input:
    batch_input = tokenizer(
        batch, return_tensors="pt", truncation=True, padding=True
    ).to("cuda")
    ds_input.append(batch_input)
    tokens = ds_model.generate(
        **batch_input,
        num_beams=1,
        max_new_tokens=32,
        early_stopping=False,
        repetition_penalty=2.0,
    )
    ds_result.extend(tokenizer.batch_decode(tokens, skip_special_tokens=True))

assert len(hf_input) == len(ds_input)
for hf_batch_input, ds_batch_input in zip(hf_input, ds_input):
    assert torch.equal(hf_batch_input["input_ids"], ds_batch_input["input_ids"])
    assert torch.equal(
        hf_batch_input["attention_mask"], ds_batch_input["attention_mask"]
    )
print("HuggingFace result:")
print(hf_result)
print("Deepspeed result:")
print(ds_result)

brevity2021 avatar May 15 '23 22:05 brevity2021

Looks like version 0.9.4 works :) Closing.

Guessing llama support fixed the gpt-neox type models: https://github.com/microsoft/DeepSpeed/pull/3425

tokestermw avatar Jun 12 '23 19:06 tokestermw