torch2trt_dynamic converted bart model is slower than the original one during inference time

hi there, I have a project to use facebook bart for news summerization. In order to make the inference faster, we are trying to convert part of the model to tensorrt and then interegerated into the original model. Via this repo, I have successfully converted facebook bart decoder layers to tensorrt model, and successfully integerated, however, the total inference time of generated tokens of the new bart model(i.e. the model integerated with converted tensorrt decoder layer) is 2 times slower than the original one, so, I tried to find why, and finally I found that the new bart model itself is faster than the original one, see code below, line1 is faster than before after changing with new bart model, but is became much slower after line2,

line1: outputs = self(model_inputs, return_dict=True) line2: next_token_logits = outputs.logits[:, -1, :] line3: next_token_logits = self.adjust_logits_during_generation( line4: next_token_logits, cur_len=cur_len, max_length=max_length)

below you can find the comparing speed of new bart model and original one (corresponding to comparing results of code line1 above), below you can find the comparing speed of new bart model and original one(corresponding to comparing results of code after line2 above) Does anyone knows why it became slow after line1 code above?

Feb 01 '21 10:02 zangyuchen2008

Hi Could you please provide the model, data and test script?

Feb 02 '21 01:02 grimoire

Hi Could you please provide the model, data and test script?

hi grimoire, I made some investigation into this problem and found that there are some gpu to cpu operation during the inference time (i.e. during beam search time), these operations make the whole process slow, some operations like tensor.tolist or some boolean operation in if statement (like if tensor: .......), I have pushed one demo project here, demo project and my env is as below: Cuda compilation tools, release 10.0, V10.0.130 tensorrt.version 7.0.0.11 torch.__version: 1.3.0

you can test by running python issue.py directly, and you should see outputs as below, as you could see , if tensorrt model is run directly, it's faster than original model, if it follows some gpu to cpu operation, it becomes much slower than the original model, how should I solve this problem? Capture if you need more information, please let me know.

Feb 02 '21 09:02 zangyuchen2008

Ok, I will let you know when I found something. I am not pro on NLP, this might take some time. And... It is better to add torch.cuda.sychronize() when you measure inference time on gpu.

Feb 02 '21 13:02 grimoire

Ok, I will let you know when I found something. I am not pro on NLP, this might take some time. And... It is better to add torch.cuda.sychronize() when you measure inference time on gpu.

ok, thanks for your suggestion, I will try and do more investigation in the mean time.

Feb 02 '21 14:02 zangyuchen2008

Ok, I will let you know when I found something. I am not pro on NLP, this might take some time. And... It is better to add torch.cuda.sychronize() when you measure inference time on gpu.

I just updated the demo project repository, and added the method I used to convert to tensorrt model to it, it's in issue.py file with method name: decoderlayer_convertor_dynamic , I test with torch.cuda.sychronize() again, the result is as below, looks like the converted model is slower than the original one, Capture1

I have also converted with another repo called torch2trt but with no dynamic shape supported, the converted tesnorrt model with fixed input shape is 6 times faster, but that does not work for me, bucause the decode process has to deal with dynamic shape.

Feb 03 '21 07:02 zangyuchen2008

Are you using this repo to accelerate decoder? FP16 mode can bonus speed by 2.7 times, but the results are different. I want to dig into it. Where is the forward() of decoder?

Feb 03 '21 10:02 grimoire

Are you using this repo to accelerate decoder? FP16 mode can bonus speed by 2.7 times, but the results are different. I want to dig into it. Where is the forward() of decoder?

I use this repo to accelerate earch decoder layer seperatelly instead of decoder , earch decoder has 12 decoder layers. The decoder layer is in this file trt_issue/transformers/models/bart/modeling_bart.py under class BartDecoderLayer(nn.Module) Capture if you need to check decoder, it's in class BartDecoder(BartPretrainedModel) under the same file.

I have tried to use FP16 to accelerate, but the error of low precision is unacceptable, I'm not sure why fp16 made such big difference in outputs. the error of FP32 is as below, Capture2 when changing to FP16, that error is as below, which leads to totally different output. Capture4 output error is calculated via

print(torch.max(torch.abs(y - y_trt)))

Feb 04 '21 01:02 zangyuchen2008

The different between output of fp16 and fp32 might be Inevitable. Significand precision of fp16 is 10 bit and exponent is 5 bit. That would limit the precision of model. I tested the fp16 convertor on a model with single fc layer:

        self.q_proj.bias = None
        query_states = self.q_proj(hidden_states) #* self.scaling
        query_states = query_states * 10
        return query_states

q_proj is the linear layer comes from BartAttention. output multi 10 to enlarge the diff(for visualize). fp32(pytorch) and fp16 results below.

max_diff: tensor(0.5802, device='cuda:0')
torch max: tensor(144.1834, device='cuda:0')  torch_min: tensor(-145.8305, device='cuda:0')
trt_max: tensor(144.3750, device='cuda:0') trt_min: tensor(-145.7500, device='cuda:0')

5 bit exponent can not given enough precision. The precision is OK in most CV task, cause intermediate value is very small (I guess).

Still working on it.

Feb 04 '21 11:02 grimoire

The different between output of fp16 and fp32 might be Inevitable. Significand precision of fp16 is 10 bit and exponent is 5 bit. That would limit the precision of model. I tested the fp16 convertor on a model with single fc layer:
        self.q_proj.bias = None
        query_states = self.q_proj(hidden_states) #* self.scaling
        query_states = query_states * 10
        return query_states
q_proj is the linear layer comes from BartAttention. output multi 10 to enlarge the diff(for visualize). fp32(pytorch) and fp16 results below.
max_diff: tensor(0.5802, device='cuda:0')
torch max: tensor(144.1834, device='cuda:0')  torch_min: tensor(-145.8305, device='cuda:0')
trt_max: tensor(144.3750, device='cuda:0') trt_min: tensor(-145.7500, device='cuda:0')
5 bit exponent can not given enough precision. The precision is OK in most CV task, cause intermediate value is very small (I guess).

Still working on it.

hello grimoire, Do you have any progress on this issue, I have been investigatting these days, but still stuck.

Feb 07 '21 04:02 zangyuchen2008

Sorry, I still do not have any solution.

Feb 07 '21 09:02 grimoire

And, by the way, have you tried to increase max_workspace_size? Some tactic need more workspace to perform accelerate.

Feb 08 '21 03:02 grimoire

And, by the way, have you tried to increase max_workspace_size? Some tactic need more workspace to perform accelerate.

hello grimoire, I made some test, increase workspace_size gain some accelerate but very little, please check the results below,

Feb 08 '21 05:02 zangyuchen2008

I set the max_workspace_size=1<<30 and mesure time as below:


    num_test = 100

    # raw test
    ## warmup
    if first_token:
        y = decoder_layer(
            decoder_layer_hidden_states,
            encoder_hidden_states=de_encoder_hidden_states,
            encoder_attention_mask=de_encoder_layer_attention_mask)
    else:
        y = decoder_layer(
            decoder_layer_hidden_states,
            encoder_hidden_states=de_encoder_hidden_states,
            encoder_attention_mask=de_encoder_layer_attention_mask,
            attention_mask=decoder_layer_attention_mask)
    for _ in range(num_test):
        start = time.time()
        torch.cuda.synchronize(device)
        with torch.no_grad():
            if first_token:
                y = decoder_layer(
                    decoder_layer_hidden_states,
                    encoder_hidden_states=de_encoder_hidden_states,
                    encoder_attention_mask=de_encoder_layer_attention_mask)
            else:
                y = decoder_layer(
                    decoder_layer_hidden_states,
                    encoder_hidden_states=de_encoder_hidden_states,
                    encoder_attention_mask=de_encoder_layer_attention_mask,
                    attention_mask=decoder_layer_attention_mask)
        torch.cuda.synchronize(device)
        end = time.time()
        raw_time = end - start
        raw_times.append(raw_time)

    # trt_test
    ## warmup
    if first_token:
        y_trt = decoder_layer_tensorrt(decoder_layer_hidden_states,
                                       de_encoder_hidden_states,
                                       de_encoder_layer_attention_mask)
    else:
        y_trt = decoder_layer_tensorrt(decoder_layer_hidden_states,
                                       de_encoder_hidden_states,
                                       de_encoder_layer_attention_mask,
                                       decoder_layer_attention_mask)
    for _ in range(num_test):
        start = time.time()
        torch.cuda.synchronize(device)
        with torch.no_grad():
            if first_token:
                y_trt = decoder_layer_tensorrt(
                    decoder_layer_hidden_states, de_encoder_hidden_states,
                    de_encoder_layer_attention_mask)
            else:
                y_trt = decoder_layer_tensorrt(
                    decoder_layer_hidden_states, de_encoder_hidden_states,
                    de_encoder_layer_attention_mask,
                    decoder_layer_attention_mask)
        torch.cuda.synchronize(device)
        end = time.time()
        tensorrt_time = end - start
        tensorrt_times.append(tensorrt_time)
    
    times = [raw_time/tensorrt_time for raw_time,tensorrt_time in zip(raw_times, tensorrt_times)]

This give me:

(base) amirstan@grimoire:~/space/tmp/trt_issue$ python decoder_test.py 
### load model
### decoderlayer_convertor_dynamic
### convert model
### begin test
raw model time is 0.0041730904579162596
tesnorrt model time is 0.003737857341766357
coresponding tensorrt model is 1.1178166823058437 times faster
diff: tensor(1.5676e-05, device='cuda:0')
tensor(3.7592, device='cuda:0') tensor(-6.1854, device='cuda:0')
tensor(3.7592, device='cuda:0') tensor(-6.1854, device='cuda:0')
(base) amirstan@grimoire:~/space/tmp/trt_issue$ python decoder_test.py 
### load model
### decoderlayer_convertor_dynamic
### convert model
### begin test
raw model time is 0.0041653323173522945
tesnorrt model time is 0.0043683481216430665
coresponding tensorrt model is 0.9556406460135142 times faster
diff: tensor(3.2518e-05, device='cuda:0')
tensor(4.2255, device='cuda:0') tensor(-6.7358, device='cuda:0')
tensor(4.2255, device='cuda:0') tensor(-6.7358, device='cuda:0')

nearly 1:1, as expected.

And this is the log of trt_exec:

[02/09/2021-01:05:49] [I] Host Latency
[02/09/2021-01:05:49] [I] min: 4.35699 ms (end to end 4.40656 ms)
[02/09/2021-01:05:49] [I] max: 8.0246 ms (end to end 8.38818 ms)
[02/09/2021-01:05:49] [I] mean: 4.47638 ms (end to end 6.95823 ms)
[02/09/2021-01:05:49] [I] median: 4.4086 ms (end to end 6.90234 ms)
[02/09/2021-01:05:49] [I] percentile: 5.11047 ms at 99% (end to end 8.21118 ms at 99%)
[02/09/2021-01:05:49] [I] throughput: 0 qps
[02/09/2021-01:05:49] [I] walltime: 2.95702 s
[02/09/2021-01:05:49] [I] Enqueue Time
[02/09/2021-01:05:49] [I] min: 1.23148 ms
[02/09/2021-01:05:49] [I] max: 7.44507 ms
[02/09/2021-01:05:49] [I] median: 1.854 ms
[02/09/2021-01:05:49] [I] GPU Compute
[02/09/2021-01:05:49] [I] min: 3.48407 ms
[02/09/2021-01:05:49] [I] max: 6.95587 ms
[02/09/2021-01:05:49] [I] mean: 3.59472 ms
[02/09/2021-01:05:49] [I] median: 3.52832 ms
[02/09/2021-01:05:49] [I] percentile: 4.20325 ms at 99%
[02/09/2021-01:05:49] [I] total compute time: 2.94048 s
&&&& PASSED TensorRT.trtexec # trtexec --explicitBatch --shapes=input_0:10x64x1024,input_1:10x128x1024,input_2:10x1x64x128,input_3:10x1x64x64 --loadEngine=tmp/trt_issue/tmp.engine --plugins=amirstan_plugin/build/lib/libamirstan_plugin.so --verbose

Feb 08 '21 17:02 grimoire