DeepSpeed [BUG] Incorrect logits on Bloom models

Describe the bug DeepSpeed(DS) optimized bloom model inference produces incorrect logits hurting overall model accuracy. Numerical differences on final logits for the phrase "This is test" when compared to pure huggingface (HF) is as follows

	avg L2	rel L2	avg L1	max elem. diff
bloom-560m DS fp16 vs HF fp16	750.240	0.00762	16.892	110.968
bloom-560m DS fp32 vs HF fp32	4.1275	0.00004	1.3686	10.054
bloom-560m HF fp16 vs HF fp32	0.006396	6.50344e-08	0.0602	0.28771

Essentially, the numerical difference between FP32 DS and HF versions is worse (more different) than difference between FP16 and FP32 versions of same model run through pure HF/pytorch.

Here the following formulas are used:

avg l2 diff is computed as $\sum_i \frac{1}{N}(x_i-y_i)^2$
rel l2 diff as $\frac{\sum_i (x_i-y_i)^2}{\sum_i y_i^2}$
avg l1 diff is computed as $\sum_i \frac{1}{N}|x_i-y_i|$

where $x$ is output of model1 (DS or HF) and $y$ is output of model2 (ground truth).

This bug might be related to issue #2729, please merge if necessary.

To Reproduce To reproduce the behavior run the following script with different options. The script simply runs deepspeed version and compares the output against the pure huggingface version.

Script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import deepspeed 
import torch

from deepspeed.module_inject.replace_policy import BLOOMLayerPolicy
import transformers

def simple_output_comparision(model_id, dtype1, dtype2, use_dp=False, use_policy=True):
    device0 = "cuda:0"
    device1 = "cuda:1"
    model1 = AutoModelForCausalLM.from_pretrained(model_id).to(device0, dtype=dtype1)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    if use_dp:
        model1 = deepspeed.init_inference(model1, 
            mp_size=1, 
            dtype=dtype1, 
            replace_method='auto', 
            injection_policy= {transformers.models.bloom.modeling_bloom.BloomBlock: BLOOMLayerPolicy} if use_policy else {},
            replace_with_kernel_inject=True)


    model2 = AutoModelForCausalLM.from_pretrained(model_id).to(device1, dtype=dtype2)
    test_input = """This is test """
    encodings = tokenizer(test_input, return_tensors="pt")

    input_ids_m1 = encodings.input_ids.to(device0)
    output_logits_m1 = model1(input_ids_m1)['logits'].cpu().to(dtype=torch.float32)

    input_ids_m2 = input_ids_m1.to(device1)
    output_logits_m2 = model2(input_ids_m2)['logits'].cpu().to(dtype=torch.float32)


    relative_avg_l2 = (output_logits_m1 - output_logits_m2).pow(2).sum()/output_logits_m2.pow(2).sum()
    avg_l2 = (output_logits_m1 - output_logits_m2).pow(2).mean()
    avg_l1 = (output_logits_m1-output_logits_m2).abs().mean()
    max_l1 = (output_logits_m1-output_logits_m2).abs().max()
    print(f"L2^2: avg_l2={avg_l2:.8f} avg_relative_l2={relative_avg_l2:.8f}; (in scientific notation: {relative_avg_l2:.5e}")
    print(f"L1: avg_l1={avg_l1:.8f} max_elementwise_l1={max_l1:.8f}")


simple_output_comparision("bigscience/bloom-560m", dtype1=torch.float16, dtype2=torch.float32, use_dp=False, use_policy=False)

Expected behavior We expect numerically close outputs, say rel l2 difference around 10**(-5) for fp32 comparison or less (0.0 ideally). We see a huge discrepancy.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.12.1+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.5+unknown, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

System info (please complete the following information):

OS: Ubuntu
GPU: A10G x8
Python 3.8.10

Jan 20 '23 08:01 akamaster

Ooops, added wrong "compression" label, it should be "inference".

Jan 20 '23 09:01 akamaster

https://github.com/microsoft/DeepSpeed/pull/2851 should fix this issue

Feb 21 '23 02:02 molly-smith

DeepSpeed DeepSpeed copied to clipboard

[BUG] Incorrect logits on Bloom models

DeepSpeed
DeepSpeed copied to clipboard