DeepSpeed [BUG] DeepSpeed non-deterministic inference with HF GPT2 when `replace_with_kernel

Describe the bug https://github.com/microsoft/DeepSpeed/issues/1950 describes a bug by which running inference twice on the same input leads to different outputs. It was supposedly fixed in version 0.6.5, but I am encountering a similar bug (for Huggingface's GPT2, on an NVidia A10G) in every deepspeed version after including 0.6.3 when running long sequences. My current fix is to use version 0.6.1.

Note: When running too short a sequence this bug does not appear. When running too long a sequence, I am rather seeing another open bug (https://github.com/microsoft/DeepSpeed/issues/2062) which prohibits inference.

Perhaps related bug: https://github.com/microsoft/DeepSpeed/issues/2229

To Reproduce

Install packages

!pip uninstall -y torch deepspeed transformers
!pip install --upgrade pip
!pip install --upgrade torch==1.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
!pip install --upgrade deepspeed==0.7.0 transformers==4.21.1

Run code

import os
import deepspeed
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = deepspeed.init_inference(model, dtype=torch.half, replace_method='auto', replace_with_kernel_inject=True)

long_sequence = "asdfjk **[][] 890 889288 =-0=- 888***&*&#*$&*(#$ &*#$ &*( *(&))  lf  ds890342888977889888***&*&#*$&*(#$ &*#$ &*( *(&))890234908 fdS 809d890342888977889888***&*&#*$&*(#$ &*#$ &*( *(&))fs 8903428889&*(#$ &*#$ &*( *(&)))"
complex_input = tokenizer(long_sequence, return_tensors="pt").to("cuda")

for _ in range(3):
    outputs = model(**complex_input)
    token_id = torch.argmax(outputs.logits.squeeze()[-1]).item()
    print(tokenizer.decode(token_id), outputs.logits.mean().item())  # we should always see the same output, but we don't

Observe that the output of the last print statement is different each time, although the input was always the same. Last time I ran it, I got e.g.

 Season -125.25
sp -170.25
 A -82.25

Expected behavior I expected to see the same output each time, i.e.

 Season -125.25
 Season -125.25
 Season -125.25

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu116
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

System info (please complete the following information):

OS: Amazon Linux 2
GPU count and types: one NVidia A10G (AWS g5.xlarge)
Python version: 3.8.12

Launcher context inside a Python notebook

Aug 19 '22 19:08 trianxy

Hi @trianxy ,

I think I know where this issue is coming from. It is due to reducing the max-tokens to 128 here. We have a PR to fix this issue. We will merge this soon to resolve this issue. Thanks, Reza

Aug 19 '22 23:08 RezaYazdaniAminabadi

Okay, I verify that by changing the MAX_OUT_TOKES to a large enough #tokens, the problem goes away. We will merge the PR soon to resolve this issue. cc: @cmikeh2

Aug 20 '22 01:08 RezaYazdaniAminabadi

Hi @trianxy,

I'm sorry for the lack of updates on this, but with latest master (should be released as 0.7.5 in the next few days) I believe the issue you're observing here is fixed. Would you mind testing this on your end to verify if this is true?

Thanks!

Nov 12 '22 17:11 cmikeh2

Thank you @cmikeh2 for coming back to me on that. I think the above issue can be closed, because it is fixed in versions 0.7.5+f2710bbe BUT ALSO in 0.7.4.

Does the fact, that it works already in 0.7.4 raise any red flags for you that we might be missing sth?

I am happy to do additional tests.

Nov 13 '22 20:11 trianxy

Thank you @cmikeh2 for coming back to me on that. I think the above issue can be closed, because it is fixed in versions 0.7.5+f2710bbe BUT ALSO in 0.7.4.

Does the fact, that it works already in 0.7.4 raise any red flags for you that we might be missing sth?

I am happy to do additional tests.

0.7.4 did have some fixes that were designed to fix this and related issues, but also introduced a couple of regressions elsewhere that meant it was kind of unpredictable where things were and weren't working, particularly with long sequence lengths. 0.7.5 (just released) should have squashed all of that and consistently work.

Nov 14 '22 20:11 cmikeh2

DeepSpeed
DeepSpeed copied to clipboard

[BUG] DeepSpeed non-deterministic inference with HF GPT2 when `replace_with_kernel_inject=True`

DeepSpeed DeepSpeed copied to clipboard

[BUG] DeepSpeed non-deterministic inference with HF GPT2 when `replace_with_kernel_inject=True`

DeepSpeed
DeepSpeed copied to clipboard