DeepSpeed
DeepSpeed copied to clipboard
[BUG] Incorrect logits on Bloom models
Describe the bug DeepSpeed(DS) optimized bloom model inference produces incorrect logits hurting overall model accuracy. Numerical differences on final logits for the phrase "This is test" when compared to pure huggingface (HF) is as follows
avg L2 | rel L2 | avg L1 | max elem. diff | |
---|---|---|---|---|
bloom-560m DS fp16 vs HF fp16 | 750.240 | 0.00762 | 16.892 | 110.968 |
bloom-560m DS fp32 vs HF fp32 | 4.1275 | 0.00004 | 1.3686 | 10.054 |
bloom-560m HF fp16 vs HF fp32 | 0.006396 | 6.50344e-08 | 0.0602 | 0.28771 |
Essentially, the numerical difference between FP32 DS and HF versions is worse (more different) than difference between FP16 and FP32 versions of same model run through pure HF/pytorch.
Here the following formulas are used:
- avg l2 diff is computed as $\sum_i \frac{1}{N}(x_i-y_i)^2$
- rel l2 diff as $\frac{\sum_i (x_i-y_i)^2}{\sum_i y_i^2}$
- avg l1 diff is computed as $\sum_i \frac{1}{N}|x_i-y_i|$
where $x$ is output of model1 (DS or HF) and $y$ is output of model2 (ground truth).
This bug might be related to issue #2729, please merge if necessary.
To Reproduce To reproduce the behavior run the following script with different options. The script simply runs deepspeed version and compares the output against the pure huggingface version.
Script:
from transformers import AutoModelForCausalLM, AutoTokenizer
import deepspeed
import torch
from deepspeed.module_inject.replace_policy import BLOOMLayerPolicy
import transformers
def simple_output_comparision(model_id, dtype1, dtype2, use_dp=False, use_policy=True):
device0 = "cuda:0"
device1 = "cuda:1"
model1 = AutoModelForCausalLM.from_pretrained(model_id).to(device0, dtype=dtype1)
tokenizer = AutoTokenizer.from_pretrained(model_id)
if use_dp:
model1 = deepspeed.init_inference(model1,
mp_size=1,
dtype=dtype1,
replace_method='auto',
injection_policy= {transformers.models.bloom.modeling_bloom.BloomBlock: BLOOMLayerPolicy} if use_policy else {},
replace_with_kernel_inject=True)
model2 = AutoModelForCausalLM.from_pretrained(model_id).to(device1, dtype=dtype2)
test_input = """This is test """
encodings = tokenizer(test_input, return_tensors="pt")
input_ids_m1 = encodings.input_ids.to(device0)
output_logits_m1 = model1(input_ids_m1)['logits'].cpu().to(dtype=torch.float32)
input_ids_m2 = input_ids_m1.to(device1)
output_logits_m2 = model2(input_ids_m2)['logits'].cpu().to(dtype=torch.float32)
relative_avg_l2 = (output_logits_m1 - output_logits_m2).pow(2).sum()/output_logits_m2.pow(2).sum()
avg_l2 = (output_logits_m1 - output_logits_m2).pow(2).mean()
avg_l1 = (output_logits_m1-output_logits_m2).abs().mean()
max_l1 = (output_logits_m1-output_logits_m2).abs().max()
print(f"L2^2: avg_l2={avg_l2:.8f} avg_relative_l2={relative_avg_l2:.8f}; (in scientific notation: {relative_avg_l2:.5e}")
print(f"L1: avg_l1={avg_l1:.8f} max_elementwise_l1={max_l1:.8f}")
simple_output_comparision("bigscience/bloom-560m", dtype1=torch.float16, dtype2=torch.float32, use_dp=False, use_policy=False)
Expected behavior We expect numerically close outputs, say rel l2 difference around 10**(-5) for fp32 comparison or less (0.0 ideally). We see a huge discrepancy.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.12.1+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.5+unknown, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6
System info (please complete the following information):
- OS: Ubuntu
- GPU: A10G x8
- Python 3.8.10
Ooops, added wrong "compression" label, it should be "inference".
https://github.com/microsoft/DeepSpeed/pull/2851 should fix this issue