TensorRT how to speed up inference with dynamic shape inputs

when I do BERT inference with trt, I found that it's hard to change the shape with context.

Problem： I want to set different shapes for differents inputs, but the following code is too slow. If I set the shape before the loop and set the max_shape for context, it will always use the max_shape for inputs, which will slowdown the speed.

ENV：CUDA 11.4 + TensorRT 8.4.3.1


with open(module_name, 'rb') as f, trt.Runtime(G_LOGGER) as runtime:
          self.engine = runtime.deserialize_cuda_engine(f.read())

origin_inputshape = (1, args.max_sequence_len)
self.inputs, self.outputs, self.bindings, self.stream = common.allocate_buffers(self.engine)

# texts with different length.
for text in texts:
       encode_inputs = self.tokenizer.encode_plus(text,
                                                   add_special_tokens=True,
                                                   max_length=max_seq_length,
                                                   padding='max_length',
                                                   return_attention_mask=True,
                                                   truncation=True,
                                                   return_tensors="np")
        self.inputs[0].host = encode_inputs['input_ids']
        self.inputs[1].host = encode_inputs['attention_mask']
        self.inputs[2].host = encode_inputs['token_type_ids']
        
        # I want to set different shapes for differents inputs, but the following code is too slow.
        # If I set the shape before the loop and set the max_shape for context, which will slow down the speed.

        input_shape = np.shape(encode_inputs['input_ids'])
        context = self.engine.create_execution_context()
        context.active_optimization_profile = 0
        context.set_binding_shape(0, (input_shape)) # input_ids
        context.set_binding_shape(1, (input_shape))  # attention_mask
        context.set_binding_shape(2, (input_shape)) # token_type_ids

        outputs = common.do_inference_v2(self.contexts[index], bindings=self.bindings, 
                                        inputs=self.inputs, outputs=self.outputs, stream=self.stream)

Sep 21 '22 11:09 yang9112

That‘s because when dynamic shape is enabled, when you specify a new binding shape for a context, at the first inference TRT will have to do a shape inference to deduce the shape for all layers, if you repeat the inference with the same shape. you should observe the inference time is dropping to normal.

Sep 22 '22 14:09 zerollzeng

That‘s because when dynamic shape is enabled, when you specify a new binding shape for a context, at the first inference TRT will have to a shape inference to deduce the shape for all layers, if you repeat the inference with the same shape. you should observe the inference time is dropping to normal.

Question: when dynamic shape is enabled，how to use fixed context (1x512）to speed up inference with shot text.

Sep 23 '22 01:09 yang9112

Just use 1x512 as opt shape when building the engine.

Sep 23 '22 02:09 zerollzeng

Just use 1x512 as opt shape when building the engine.

it seems not work.

context shape	min/opt/max shape	inputs length	TPS (samples / second)
1x512	1x1/1x512/1x512	155	110
1x256	1x1/1x512/1x512	155	223

import pycuda.driver as cuda
import tensorrt as trt
import pycuda.autoinit
from scipy.special import softmax
from transformers import BertTokenizerFast
import common
from common import HostDeviceMem

module_name = 'pytorch_model_fp16.trt'
G_LOGGER = trt.Logger(trt.Logger.WARNING)

def allocate_buffer(engine, batch_size, input_shape, output_shape):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        input_nbytes = trt.volume(input_shape) * trt.int32.itemsize
        # size = trt.volume(engine.get_binding_shape(binding)) * input_shape[0]
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        if engine.binding_is_input(binding):
            if input_shape[0] == 1:
                input_nbytes = 2 * input_nbytes
            device_mem = cuda.mem_alloc(input_nbytes)
            inputs.append(HostDeviceMem(None, device_mem))
        else:
            host_mem = cuda.pagelocked_empty(output_shape[0] * output_shape[1], dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            outputs.append(HostDeviceMem(host_mem, device_mem))
        bindings.append(int(device_mem))
    return inputs, outputs, bindings, stream


def main():
    max_seq_length = 512

    with open(module_name, 'rb') as f, trt.Runtime(G_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
        origin_inputshape = (1, max_seq_length)
        inputs, outputs, bindings, stream = allocate_buffer(engine, None, 
                                                            input_shape=origin_inputshape, 
                                                            output_shape=(1,2))
        # which will be set to 1x256 or 1x512
        input_shape = origin_inputshape
        context = engine.create_execution_context()
        context.active_optimization_profile = 0
        context.set_binding_shape(0, (input_shape))
        context.set_binding_shape(1, (input_shape))
        context.set_binding_shape(2, (input_shape))

    tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')

    text = '一博| 愿你在岁月中更加平湖静月，衣襟生花，你好你坏我们都在'
    encode_inputs = tokenizer.encode_plus(text,
                                            add_special_tokens=True,
                                            max_length=max_seq_length,
                                            padding='max_length',
                                            return_attention_mask=True,
                                            truncation=True,
                                            return_tensors="np")

    from tqdm import tqdm

    for i in tqdm(range(1000)):
        inputs[0].host = encode_inputs['input_ids']
        inputs[1].host = encode_inputs['attention_mask']
        inputs[2].host = encode_inputs['token_type_ids']
        results = common.do_inference_v2(context, bindings=bindings, 
                                        inputs=inputs, outputs=outputs, stream=stream)
        logits = results[0]
        probabilities = softmax(logits, axis=0)


if __name__ == '__main__':
    main()

Sep 23 '22 05:09 yang9112

when dynamic shapes is enabled, TRT will select kernel tactics that have the best performance and are suitable for all input shapes between the min shape and the max shape. So the best choice is to set the opt shape to the frequently used shape in your scenario so that TRT can have the best performance overall.

Sep 23 '22 09:09 zerollzeng

when dynamic shapes is enabled, TRT will select kernel tactics that have the best performance and are suitable for all input shapes between the min shape and the max shape. So the best choice is to set the opt shape to the frequently used shape in your scenario so that TRT can have the best performance overall.

I have tested the trt with the upper code, but I found that there is no different performance between opt shapes (1x256 and 1x512) with the same shot text .

When I enable the dynamic shape, why the shorter text can not get a better performance ?

If the kernel tactics is selected by the shape of context, Should I define different contexts in one engine and select different context with different text length ?

Sep 23 '22 10:09 yang9112

your "shorter text shape" != opt shape right? trt only make sure the kernel is able to run with the "shorter text shape" but don't guarantee its performance. only optimize performance for the opt shape.

Sep 24 '22 03:09 zerollzeng

your "shorter text shape" != opt shape right? trt only make sure the kernel is able to run with the "shorter text shape" but don't guarantee its performance. only optimize performance for the opt shape.

So, the context shape do not need to same as the input shape ?

Another Question: When the context shape (binding shape) is set smaller , the engine inference will be faster. Is the input shape will be truncated due to context shape is smaller?

context shape	input buffer	input shape	TPS
1x128	1x512	1x155	300
1x256	1x512	1x155	240
1x512	1x512	1x155	130

Sep 26 '22 02:09 yang9112

Please refer to https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes

Is the input shape will be truncated due to the context shape being smaller?

Yes, it always uses the binding shape as the input, e.g. if you set the binding shape as 1x256, but you have an input buffer of 1x100000, it will only take the first 1x256 as input and do the inference.

I feel like you have some incorrect understanding about the dynamic shape of TRT, I would suggest reading our developer guide first :)

Sep 26 '22 03:09 zerollzeng

Please refer to https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes

Is the input shape will be truncated due to the context shape being smaller?

Yes, it always uses the binding shape as the input, e.g. if you set the binding shape as 1x256, but you have an input buffer of 1x100000, it will only take the first 1x256 as input and do the inference.

I feel like you have some incorrect understanding about the dynamic shape of TRT, I would suggest reading our developer guide first :)

thx

Sep 26 '22 04:09 yang9112

TensorRT TensorRT copied to clipboard

how to speed up inference with dynamic shape inputs

TensorRT
TensorRT copied to clipboard