DeepSpeed
DeepSpeed copied to clipboard
[BUG] transformer kernel: results misaligned with huggingface due to padding
Describe the bug When padding, the output of the transformer kernel is different from the output of huggingface's BertLayer
To Reproduce
ds_layer = DeepSpeedTransformerLayer(ds_config, all_weight, all_bias).cuda()
bert_layer = BertLayer(bert_config).cuda()
data = torch.rand((batch_size, seq_length, hidden_size), dtype=torch.float32).cuda()
mask = torch.ones((batch_size, 1, 1, seq_length), dtype=torch.float32) * -10000
mask[:, :, :, : seq_length // 2] = 0.0
# mask[:, :, :, : seq_length] = 0.0 # this make the output is basically the same
mask = mask.cuda()
if fp16:
data = data.half()
ds_layer = ds_layer.half()
bert_layer = bert_layer.half()
ds_output = ds_layer(data, mask)
bert_output = bert_layer(data, mask)
max_diff = torch.max(torch.abs(bert_output[0] - ds_output))
mean_diff = torch.mean(torch.abs(bert_output[0] - ds_output))
print(f"max_diff: {max_diff}")
print(f"mean_diff: {mean_diff}")
output: max_diff: 0.09228515625 mean_diff: 0.0198822021484375
Expected behavior results aligned with huggingface while do padding
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [YES] ...... [NO]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/anaconda3/lib/python3.7/site-packages/torch']
torch version .................... 1.10.0a0+git36449ea
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/usr/local/anaconda3/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.5.8, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.1
the config is:
ds_config = DeepSpeedTransformerConfig(
batch_size=1,
hidden_size=768,
intermediate_size=768*4,
heads=12,
hidden_dropout_ratio=0,
attn_dropout_ratio=0,
num_hidden_layers=12,
initializer_range=0.02,
local_rank=0,
layer_norm_eps=1e-12,
fp16=fp16,
training=True,
pre_layer_norm=False,
seq_length=128
)
Since padding will lead to different results, when seq_length is not a multiple of 16, even if the original data has no padding, the output will be different, because the Deepspeed transformer kernel will pad the input to a multiple of 16
could somebody help this issue
Hi @dancingpipi
I will look into this and will send a fix soon.
Best, Reza
Hi @dancingpipi
I will look into this and will send a fix soon.
Best, Reza
Thank you very much for your attention to this issue, looking forward to your fix
@RezaYazdaniAminabadi Take the liberty to ask, is there any progress on this issue?
@dancingpipi, sorry I was sick and now I am back and will take care of this. I will let you once I prepare the PR.
@RezaYazdaniAminabadi Take care, health first.
Hi @dancingpipi
I finally get some time to test this on my side. So, I modified your script a bit so that I inject the transformer kernel without passing the weights and biases to it. This will automatically copy all the tensors to run the module with transformer kernel. Here is the script:
import torch
import deepspeed
from deepspeed.ops import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
from transformers import BertLayer, BertConfig
fp16 = False
ds_config = DeepSpeedTransformerConfig(
batch_size=1,
hidden_size=768,
intermediate_size=768*4,
heads=12,
hidden_dropout_ratio=0,
attn_dropout_ratio=0,
num_hidden_layers=12,
initializer_range=0.02,
local_rank=0,
layer_norm_eps=1e-12,
fp16=fp16,
training=True,
pre_layer_norm=False
)
bert_config = BertConfig(hidden_size=ds_config.hidden_size,
num_hidden_layers=ds_config.num_hidden_layers,
num_attention_heads=ds_config.heads,
batch_size=ds_config.batch_size,
intermediate_size=ds_config.intermediate_size,
hidden_act="gelu",
hidden_dropout_prob=ds_config.hidden_dropout_ratio,
attention_probs_dropout_prob=ds_config.attn_dropout_ratio,
max_position_embeddings=128,
type_vocab_size=2,
initializer_range=ds_config.initializer_range,
fp16=ds_config.fp16)
bert_layer = BertLayer(bert_config).cuda()
ds_layer = deepspeed.module_inject.replace_transformer_layer(BertLayer, bert_layer, config=bert_config, training=True, fp16=fp16)
seq_length=7
data = torch.rand((1, seq_length, ds_config.hidden_size), dtype=torch.float32).cuda()
mask = torch.ones((1, 1, 1, seq_length), dtype=torch.float32) * -10000
mask[:, :, :, : seq_length // 2] = 0.0
# mask[:, :, :, : seq_length] = 0.0 # this make the output is basically the same
mask = mask.cuda()
if fp16:
data = data.half()
mask = mask.half()
ds_layer = ds_layer.half()
bert_layer = bert_layer.half()
ds_output = ds_layer(data, mask)
bert_output = bert_layer(data, mask)
max_diff = torch.max(torch.abs(bert_output[0] - ds_output[0]))
mean_diff = torch.mean(torch.abs(bert_output[0] - ds_output[0]))
print(f"max_diff: {max_diff}")
print(f"mean_diff: {mean_diff}")
After running the test with both FP32 and FP16, I see very close results between baseline and DeepSpeed:
FP32:
layer #0 is created with date type [float].
max_diff: 0.0007888078689575195
mean_diff: 0.0001782989565981552
FP16:
layer #0 is created with date type [half].
max_diff: 0.00390625
mean_diff: 0.0005054473876953125
I have made a small PR so that the injection can work with this test. Could you please try this on your side too? Thanks, Reza
@RezaYazdaniAminabadi Thanks a lot! I found that the cause of the diff was that I didn't set the mask to fp16. Sorry for wasting your time