DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] transformer kernel: results misaligned with huggingface due to padding

Open dancingpipi opened this issue 2 years ago • 9 comments

Describe the bug When padding, the output of the transformer kernel is different from the output of huggingface's BertLayer

To Reproduce

ds_layer = DeepSpeedTransformerLayer(ds_config, all_weight, all_bias).cuda()
bert_layer = BertLayer(bert_config).cuda()

data = torch.rand((batch_size, seq_length, hidden_size), dtype=torch.float32).cuda()
mask = torch.ones((batch_size, 1, 1, seq_length), dtype=torch.float32) * -10000
mask[:, :, :, : seq_length // 2] = 0.0 
# mask[:, :, :, : seq_length] = 0.0    # this make the output is basically the same
mask = mask.cuda()

if fp16:
    data = data.half()
    ds_layer = ds_layer.half()
    bert_layer = bert_layer.half()

ds_output = ds_layer(data, mask)
bert_output = bert_layer(data, mask)

max_diff = torch.max(torch.abs(bert_output[0] - ds_output))
mean_diff = torch.mean(torch.abs(bert_output[0] - ds_output))
print(f"max_diff: {max_diff}")
print(f"mean_diff: {mean_diff}")

output: max_diff: 0.09228515625 mean_diff: 0.0198822021484375

Expected behavior results aligned with huggingface while do padding

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [YES] ...... [NO]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/anaconda3/lib/python3.7/site-packages/torch']
torch version .................... 1.10.0a0+git36449ea
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/usr/local/anaconda3/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.5.8, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.1

dancingpipi avatar May 19 '22 06:05 dancingpipi

the config is:

ds_config = DeepSpeedTransformerConfig(
        batch_size=1,
        hidden_size=768,
        intermediate_size=768*4,
        heads=12,
        hidden_dropout_ratio=0,
        attn_dropout_ratio=0,
        num_hidden_layers=12,
        initializer_range=0.02,
        local_rank=0,
        layer_norm_eps=1e-12,
        fp16=fp16,
        training=True,
        pre_layer_norm=False,
        seq_length=128
        )

Since padding will lead to different results, when seq_length is not a multiple of 16, even if the original data has no padding, the output will be different, because the Deepspeed transformer kernel will pad the input to a multiple of 16

dancingpipi avatar May 19 '22 06:05 dancingpipi

could somebody help this issue

dancingpipi avatar May 30 '22 08:05 dancingpipi

Hi @dancingpipi

I will look into this and will send a fix soon.

Best, Reza

RezaYazdaniAminabadi avatar Jun 06 '22 16:06 RezaYazdaniAminabadi

Hi @dancingpipi

I will look into this and will send a fix soon.

Best, Reza

Thank you very much for your attention to this issue, looking forward to your fix

dancingpipi avatar Jun 07 '22 03:06 dancingpipi

@RezaYazdaniAminabadi Take the liberty to ask, is there any progress on this issue?

dancingpipi avatar Jun 20 '22 10:06 dancingpipi

@dancingpipi, sorry I was sick and now I am back and will take care of this. I will let you once I prepare the PR.

RezaYazdaniAminabadi avatar Jun 22 '22 17:06 RezaYazdaniAminabadi

@RezaYazdaniAminabadi Take care, health first.

dancingpipi avatar Jun 23 '22 02:06 dancingpipi

Hi @dancingpipi

I finally get some time to test this on my side. So, I modified your script a bit so that I inject the transformer kernel without passing the weights and biases to it. This will automatically copy all the tensors to run the module with transformer kernel. Here is the script:


import torch
import deepspeed
from deepspeed.ops import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
from transformers import BertLayer, BertConfig

fp16 = False
ds_config = DeepSpeedTransformerConfig(
        batch_size=1,
        hidden_size=768,
        intermediate_size=768*4,
        heads=12,
        hidden_dropout_ratio=0,
        attn_dropout_ratio=0,
        num_hidden_layers=12,
        initializer_range=0.02,
        local_rank=0,
        layer_norm_eps=1e-12,
        fp16=fp16,
        training=True,
        pre_layer_norm=False
        )
bert_config = BertConfig(hidden_size=ds_config.hidden_size,
                         num_hidden_layers=ds_config.num_hidden_layers,
                         num_attention_heads=ds_config.heads,
                         batch_size=ds_config.batch_size,
                         intermediate_size=ds_config.intermediate_size,
                         hidden_act="gelu",
                         hidden_dropout_prob=ds_config.hidden_dropout_ratio,
                         attention_probs_dropout_prob=ds_config.attn_dropout_ratio,
                         max_position_embeddings=128,
                         type_vocab_size=2,
                         initializer_range=ds_config.initializer_range,
                         fp16=ds_config.fp16)

bert_layer = BertLayer(bert_config).cuda()

ds_layer = deepspeed.module_inject.replace_transformer_layer(BertLayer, bert_layer, config=bert_config, training=True, fp16=fp16)

seq_length=7
data = torch.rand((1, seq_length, ds_config.hidden_size), dtype=torch.float32).cuda()
mask = torch.ones((1, 1, 1, seq_length), dtype=torch.float32) * -10000
mask[:, :, :, : seq_length // 2] = 0.0 
# mask[:, :, :, : seq_length] = 0.0    # this make the output is basically the same
mask = mask.cuda()

if fp16:
    data = data.half()
    mask = mask.half()
    ds_layer = ds_layer.half()
    bert_layer = bert_layer.half()

ds_output = ds_layer(data, mask)
bert_output = bert_layer(data, mask)

max_diff = torch.max(torch.abs(bert_output[0] - ds_output[0]))
mean_diff = torch.mean(torch.abs(bert_output[0] - ds_output[0]))
print(f"max_diff: {max_diff}")
print(f"mean_diff: {mean_diff}")

After running the test with both FP32 and FP16, I see very close results between baseline and DeepSpeed:


FP32:
layer #0 is created with date type [float].                                                                                                                                
max_diff: 0.0007888078689575195                                                                                                                                            
mean_diff: 0.0001782989565981552 
FP16:
layer #0 is created with date type [half].                                                                                                                                 
max_diff: 0.00390625                                                                                                                                                       
mean_diff: 0.0005054473876953125    

I have made a small PR so that the injection can work with this test. Could you please try this on your side too? Thanks, Reza

RezaYazdaniAminabadi avatar Jun 28 '22 16:06 RezaYazdaniAminabadi

@RezaYazdaniAminabadi Thanks a lot! I found that the cause of the diff was that I didn't set the mask to fp16. Sorry for wasting your time

dancingpipi avatar Jul 01 '22 02:07 dancingpipi