DeepSpeed [BUG] GPT-J InferenceEngine outputs diverging from base GPT-J

Describe the bug The GPT-J InferenceEngine returns low-quality outputs that diverge from the base model.

UPDATE

I rolled back to the versions we have in production and confirmed that GPT-J and GPT-J InferenceEngine return identical outputs. I have not tested any later versions, so I can't provide any info, at the moment, about when the bug was introduced (or where it's caused exists).

Specifically, I observed identical outputs running roughly the same code as below in a container with:

# Base container
pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel

transformers==4.15 
deepspeed==0.5.10

To Reproduce Steps to reproduce the behavior:

Install packages:

pip3 install torch==1.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install transformers==4.20.1
pip install deepspeed==0.6.5

Run the following:

import os
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# Get local gpu rank from torch.distributed/deepspeed launcher
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))

model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

pipe(
    "All happy families are alike, but ",
    do_sample=False,
)

# Note, depending on your device, you might need to clear GPU cache to initialize the inf engine.
model = deepspeed.init_inference(model,
                                 mp_size=1,
                                 dtype=torch.float16,
                                 replace_method='auto',
                                 replace_with_kernel_inject=True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
pipe(
    "All happy families are alike, but ",
    do_sample=False,
)

Expected behavior With do_sample=False, these outputs should be identical. Further, the repetition of new lines seems pathological. Testing across various other inputs, I noticed similar degraded outputs from the inference engine.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 10.2

System info (please complete the following information):

OS: AWS SageMaker Notebook instance
GPU count and types: 1 Nvidia 16GB T4 GPU.
Python version: 3.8

Launcher context No launcher, just running in a notebook.

Additional context I believe that this exact code was working until recently; however, I'm not sure that I ever explicitly tested output equality. I can say that output quality definitely seems degraded.

UPDATE:

I believe I've isolated the surface-level cause of the divergence.

The initial behavior I observed replicates reliably when low_cpu_mem_usage=True is used to load the pretrained model. E.g.:

from transformers import AutoTokenizer, pipeline, GPTJForCausalLM import torch import deepspeed

model = GPTJForCausalLM.from_pretrained( "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True # low cpu usage set true here )

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", revision="float16")

model = deepspeed.init_inference(model, mp_size=1, dtype=torch.float16, replace_method='auto', replace_with_kernel_inject=True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

pipe( "All happy families are alike, but ", do_sample=False, ) reliably returns: [{'generated_text': 'All happy families are alike, but --\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\nw\nw\n'}]

However, using a model loaded with low_cpu_mem_usage=False, reliably returns [{'generated_text': 'All happy families are alike, but \nevery unhappy family is unhappy in its own way.\n\n—LEWIS CARROLL\n\n## **THE BEGINNING**\n\nLet the conversation begin...\n\nFollow the Penguin'}]

UPDATE 2:

I also observe divergence outputs when I export them model with torch (e.g. torch.save()) and load it with torch.load().

Jun 23 '22 19:06 joehoover

Hi @joehoover ,

I ran this with the master branch and I see the following output: [{'generated_text': 'All happy families are alike, but each unhappy family is unhappy in its own way.\n\n—Leo Tolstoy\n\nI’m not sure if I’m a happy person. I’m not sure if I'}]

I also see the same output when running several times with do_sample=False. Can you please test this on your side with the latest deepspeed branch and see if the issue persists? Thanks, Reza

Jun 23 '22 23:06 RezaYazdaniAminabadi

Hey @RezaYazdaniAminabadi , thanks for looking into this.

Something seems weird...are you using the fp16 weights? The issue I described above aside, I'm getting consistently different results from you even from the baseline model.

model = AutoModelForCausalLM.from_pretrained(
    model_dir, 
    revision="float16", 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True
)

tokenizer = AutoTokenizer.from_pretrained(model_dir)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

pipe(
    "All happy families are alike, but ",
    do_sample=False,
)

Returns:

[{'generated_text': 'All happy families are alike, but \nevery unhappy family is unhappy in its own way.\n\n—LEWIS CARROLL\n\n## THE BEGINNING\n\nLet the conversation begin...\n\nFollow the Penguin'}]

I also installed DeepSpeed from main and observed the same output as above.

I'm going to run some of the GPT-J Transformers unit tests and see if I they pass. I'll follow up here and let you know what I find.

UPDATE:

I did some more testing @RezaYazdaniAminabadi and it looks like you're dropping the trailing " " I had in my prompt. With the base model, "All happy families are alike, but" yields the output you observed, but "All happy families are alike, but " yields the output I observed.

I also confirmed that I can pass several of the equality assertions in GPT-J's unit tests.

torch_device="cuda:0"
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16)
model.to(torch_device)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", revision="float16")

input_ids = torch.tensor([[464, 3290]], dtype=torch.long, device=torch_device)  # The dog
expected_output_ids = np.array([464, 3290, 318, 257, 582, 338, 1266, 1545, 13, 632, 318, 257, 9112, 15185, 11, 290, 340, 318, 257, 1545])


output_ids = model.generate(input_ids, do_sample=False)[0].cpu().numpy()
np.array_equal(expected_output_ids, output_ids)

torch.manual_seed(0)
torch_device="cuda"
tokenized = tokenizer("Today is a nice day and", return_tensors="pt", return_token_type_ids=True)
input_ids = tokenized.input_ids.to(torch_device)
output_ids = model.generate(input_ids, do_sample=True)
output_str = tokenizer.decode(output_ids[0], skip_special_tokens=True)


if torch_device == "cuda":
    EXPECTED_OUTPUT_STR = (
        "Today is a nice day and I've already been enjoying it. I walked to work with my wife"
    )
else:
    EXPECTED_OUTPUT_STR = "Today is a nice day and one of those days that feels a bit more alive. I am ready"
    
output_str

# Note, there is a substring match in output_str, but more tokens are generated than are specified in the expected string.
# perhaps due to a change in default behavior?

"Today is a nice day and I've already been enjoying it. I walked to work with my wife. It was good to just talk to each other. We also had another great dinner date with friends for our anniversary.\n\nToday was a relaxing"

UPDATE 2:

I just failed to replicate the issue I originally noted -- my InferenceEngine output is identical to the output from the base model. I'll try to isolate the issue and report back.

UPDATE 3:

I believe I've isolated the surface-level cause of the divergence.

The initial behavior I observed replicates reliably when low_cpu_mem_usage=True is used to load the pretrained model. E.g.:

from transformers import AutoTokenizer, pipeline, GPTJForCausalLM
import torch
import deepspeed

model = GPTJForCausalLM.from_pretrained(
   "EleutherAI/gpt-j-6B", 
   revision="float16", 
   torch_dtype=torch.float16,
   low_cpu_mem_usage=True # low cpu usage set true here
)

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", revision="float16")

model = deepspeed.init_inference(model,
                                 mp_size=1,
                                 dtype=torch.float16,
                                 replace_method='auto',
                                 replace_with_kernel_inject=True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

pipe(
    "All happy families are alike, but ",
    do_sample=False,
)

reliably returns: [{'generated_text': 'All happy families are alike, but --\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\nw\nw\n'}]

However, using a model loaded with low_cpu_mem_usage=False, reliably returns [{'generated_text': 'All happy families are alike, but \nevery unhappy family is unhappy in its own way.\n\n—LEWIS CARROLL\n\n## **THE BEGINNING**\n\nLet the conversation begin...\n\nFollow the Penguin'}]

Jun 24 '22 14:06 joehoover

Hi, I was also came into the similar problem that GPT-J with deepspeed generated worse than the original one. While the different is that when using the prompt above (All happy families are alike, but ) or any other prompts whose tokens length are short, I can always get the exactly same outputs of the original one and the deepspeed one, but once I increase the length of the prompt, the divergence appears and the repetition of single character begins.

The environment of mine is:
python==3.8.11 pytorch==1.11.0 transformers==4.16.0 deepspeed==0.6.5 CUDA=11.3 GPU=A30

Jun 28 '22 16:06 PanQiWei

@PanQiWei , could you share a reproducible code chunk? I'd like to see if we're getting the same outputs and if I can reproduce your observations.

Also, just to confirm, you observe this specifically when you increase the length of the prompt, but not when you generate more tokens, given a prompt?

UPDATE:

I was able to reproduce @PanQiWei 's observation. I have not attempted to determine when divergences start occurring, but the following 930 token prompt yields this output from a GPT-J InferenceEngine with greedy search: [{'generated_text': '\n.....\n.\n.\n...\n.\n..\n.\n\nThem.\n\nThe\n..\n\n.\n\n..\n.\n\n.\n.\n\n.\n'}]

However, the vanilla model yields: [{'generated_text': '\n\nBraddocks was a very nice boy, but he was a little too nice. He was a little too nice to be a good friend. He was too nice to be a good friend to Cohn, who was too nice to be a'}]

text = (
"""Robert Cohn was once middleweight boxing champion of Princeton. Do not think that I am very much impressed by that as a boxing title, but it meant a lot to Cohn. He cared nothing for boxing, in fact he disliked it, but he learned it painfully and thoroughly to counteract the feeling of inferiority and shyness he had felt on being treated as a Jew at Princeton. There was a certain inner comfort in knowing he could knock down anybody who was snooty to him, although, being very shy and a thoroughly nice boy, he never fought except in the gym. He was Spider Kelly’s star pupil. Spider Kelly taught all his young gentlemen to box like featherweights, no matter whether they weighed one hundred and five or two hundred and five pounds. But it seemed to fit Cohn. He was really very fast. He was so good that Spider promptly overmatched him and got his nose permanently flattened. This increased Cohn’s distaste for boxing, but it gave him a certain satisfaction of some strange sort, and it certainly improved his nose. In his last year at Princeton he read too much and took to wearing spectacles. I never met any one of his class who remembered him. They did not even remember that he was middleweight boxing champion.

I mistrust all frank and simple people, especially when their stories hold together, and I always had a suspicion that perhaps Robert Cohn had never been middleweight boxing champion, and that perhaps a horse had stepped on his face, or that maybe his mother had been frightened or seen something, or that he had, maybe, bumped into something as a young child, but I finally had somebody verify the story from Spider Kelly. Spider Kelly not only remembered Cohn. He had often wondered what had become of him.

Robert Cohn was a member, through his father, of one of the richest Jewish families in New York, and through his mother of one of the oldest. At the military school where he prepped for Princeton, and played a very good end on the football team, no one had made him race-conscious. No one had ever made him feel he was a Jew, and hence any different from anybody else, until he went to Princeton. He was a nice boy, a friendly boy, and very shy, and it made him bitter. He took it out in boxing, and he came out of Princeton with painful self-consciousness and the flattened nose, and was married by the first girl who was nice to him. He was married five years, had three children, lost most of the fifty thousand dollars his father left him, the balance of the estate having gone to his mother, hardened into a rather unattractive mould under domestic unhappiness with a rich wife; and just when he had made up his mind to leave his wife she left him and went off with a miniature-painter. As he had been thinking for months about leaving his wife and had not done it because it would be too cruel to deprive her of himself, her departure was a very healthful shock.

The divorce was arranged and Robert Cohn went out to the Coast. In California he fell among literary people and, as he still had a little of the fifty thousand left, in a short time he was backing a review of the Arts. The review commenced publication in Carmel, California, and finished in Provincetown, Massachusetts. By that time Cohn, who had been regarded purely as an angel, and whose name had appeared on the editorial page merely as a member of the advisory board, had become the sole editor. It was his money and he discovered he liked the authority of editing. He was sorry when the magazine became too expensive and he had to give it up.

By that time, though, he had other things to worry about. He had been taken in hand by a lady who hoped to rise with the magazine. She was very forceful, and Cohn never had a chance of not being taken in hand. Also he was sure that he loved her. When this lady saw that the magazine was not going to rise, she became a little disgusted with Cohn and decided that she might as well get what there was to get while there was still something available, so she urged that they go to Europe, where Cohn could write. They came to Europe, where the lady had been educated, and stayed three years. During these three years, the first spent in travel, the last two in Paris, Robert Cohn had two friends, Braddocks and myself. Braddocks was his literary friend. I was his tennis friend."""
)

pipe(text, do_sample=False, max_new_tokens=50)

Jun 28 '22 20:06 joehoover

I also encountered similar results given long prompts, example output is the lovely string: � re A p first _- long m ex

Jul 13 '22 22:07 PyxAI

@joehoover , @PyxAI, @PanQiWei the issue related to low_cpu_mem_usage is root caused and you can look at the fix here https://github.com/microsoft/DeepSpeed/pull/2489. Could you please try it on your side and confirm if the issue is fixed?

Nov 08 '22 19:11 lokoppakmsft

DeepSpeed DeepSpeed copied to clipboard

[BUG] GPT-J InferenceEngine outputs diverging from base GPT-J

DeepSpeed
DeepSpeed copied to clipboard