TensorRT
TensorRT copied to clipboard
how to speed up inference with dynamic shape inputs
when I do BERT inference with trt, I found that it's hard to change the shape with context.
Problem: I want to set different shapes for differents inputs, but the following code is too slow. If I set the shape before the loop and set the max_shape for context, it will always use the max_shape for inputs, which will slowdown the speed.
ENV:CUDA 11.4 + TensorRT 8.4.3.1
with open(module_name, 'rb') as f, trt.Runtime(G_LOGGER) as runtime:
self.engine = runtime.deserialize_cuda_engine(f.read())
origin_inputshape = (1, args.max_sequence_len)
self.inputs, self.outputs, self.bindings, self.stream = common.allocate_buffers(self.engine)
# texts with different length.
for text in texts:
encode_inputs = self.tokenizer.encode_plus(text,
add_special_tokens=True,
max_length=max_seq_length,
padding='max_length',
return_attention_mask=True,
truncation=True,
return_tensors="np")
self.inputs[0].host = encode_inputs['input_ids']
self.inputs[1].host = encode_inputs['attention_mask']
self.inputs[2].host = encode_inputs['token_type_ids']
# I want to set different shapes for differents inputs, but the following code is too slow.
# If I set the shape before the loop and set the max_shape for context, which will slow down the speed.
input_shape = np.shape(encode_inputs['input_ids'])
context = self.engine.create_execution_context()
context.active_optimization_profile = 0
context.set_binding_shape(0, (input_shape)) # input_ids
context.set_binding_shape(1, (input_shape)) # attention_mask
context.set_binding_shape(2, (input_shape)) # token_type_ids
outputs = common.do_inference_v2(self.contexts[index], bindings=self.bindings,
inputs=self.inputs, outputs=self.outputs, stream=self.stream)
That‘s because when dynamic shape is enabled, when you specify a new binding shape for a context, at the first inference TRT will have to do a shape inference to deduce the shape for all layers, if you repeat the inference with the same shape. you should observe the inference time is dropping to normal.
That‘s because when dynamic shape is enabled, when you specify a new binding shape for a context, at the first inference TRT will have to a shape inference to deduce the shape for all layers, if you repeat the inference with the same shape. you should observe the inference time is dropping to normal.
Question: when dynamic shape is enabled,how to use fixed context (1x512)to speed up inference with shot text.
Just use 1x512 as opt shape when building the engine.
Just use 1x512 as opt shape when building the engine.
it seems not work.
| context shape | min/opt/max shape | inputs length | TPS (samples / second) |
|---|---|---|---|
| 1x512 | 1x1/1x512/1x512 | 155 | 110 |
| 1x256 | 1x1/1x512/1x512 | 155 | 223 |
import pycuda.driver as cuda
import tensorrt as trt
import pycuda.autoinit
from scipy.special import softmax
from transformers import BertTokenizerFast
import common
from common import HostDeviceMem
module_name = 'pytorch_model_fp16.trt'
G_LOGGER = trt.Logger(trt.Logger.WARNING)
def allocate_buffer(engine, batch_size, input_shape, output_shape):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
input_nbytes = trt.volume(input_shape) * trt.int32.itemsize
# size = trt.volume(engine.get_binding_shape(binding)) * input_shape[0]
dtype = trt.nptype(engine.get_binding_dtype(binding))
if engine.binding_is_input(binding):
if input_shape[0] == 1:
input_nbytes = 2 * input_nbytes
device_mem = cuda.mem_alloc(input_nbytes)
inputs.append(HostDeviceMem(None, device_mem))
else:
host_mem = cuda.pagelocked_empty(output_shape[0] * output_shape[1], dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
outputs.append(HostDeviceMem(host_mem, device_mem))
bindings.append(int(device_mem))
return inputs, outputs, bindings, stream
def main():
max_seq_length = 512
with open(module_name, 'rb') as f, trt.Runtime(G_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
origin_inputshape = (1, max_seq_length)
inputs, outputs, bindings, stream = allocate_buffer(engine, None,
input_shape=origin_inputshape,
output_shape=(1,2))
# which will be set to 1x256 or 1x512
input_shape = origin_inputshape
context = engine.create_execution_context()
context.active_optimization_profile = 0
context.set_binding_shape(0, (input_shape))
context.set_binding_shape(1, (input_shape))
context.set_binding_shape(2, (input_shape))
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
text = '一博| 愿你在岁月中更加平湖静月,衣襟生花,你好你坏我们都在'
encode_inputs = tokenizer.encode_plus(text,
add_special_tokens=True,
max_length=max_seq_length,
padding='max_length',
return_attention_mask=True,
truncation=True,
return_tensors="np")
from tqdm import tqdm
for i in tqdm(range(1000)):
inputs[0].host = encode_inputs['input_ids']
inputs[1].host = encode_inputs['attention_mask']
inputs[2].host = encode_inputs['token_type_ids']
results = common.do_inference_v2(context, bindings=bindings,
inputs=inputs, outputs=outputs, stream=stream)
logits = results[0]
probabilities = softmax(logits, axis=0)
if __name__ == '__main__':
main()
when dynamic shapes is enabled, TRT will select kernel tactics that have the best performance and are suitable for all input shapes between the min shape and the max shape. So the best choice is to set the opt shape to the frequently used shape in your scenario so that TRT can have the best performance overall.
when dynamic shapes is enabled, TRT will select kernel tactics that have the best performance and are suitable for all input shapes between the min shape and the max shape. So the best choice is to set the opt shape to the frequently used shape in your scenario so that TRT can have the best performance overall.
I have tested the trt with the upper code, but I found that there is no different performance between opt shapes (1x256 and 1x512) with the same shot text .
When I enable the dynamic shape, why the shorter text can not get a better performance ?
If the kernel tactics is selected by the shape of context, Should I define different contexts in one engine and select different context with different text length ?
your "shorter text shape" != opt shape right? trt only make sure the kernel is able to run with the "shorter text shape" but don't guarantee its performance. only optimize performance for the opt shape.
your "shorter text shape" != opt shape right? trt only make sure the kernel is able to run with the "shorter text shape" but don't guarantee its performance. only optimize performance for the opt shape.
So, the context shape do not need to same as the input shape ?
Another Question: When the context shape (binding shape) is set smaller , the engine inference will be faster. Is the input shape will be truncated due to context shape is smaller?
| context shape | input buffer | input shape | TPS |
|---|---|---|---|
| 1x128 | 1x512 | 1x155 | 300 |
| 1x256 | 1x512 | 1x155 | 240 |
| 1x512 | 1x512 | 1x155 | 130 |
Please refer to https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes
Is the input shape will be truncated due to the context shape being smaller?
Yes, it always uses the binding shape as the input, e.g. if you set the binding shape as 1x256, but you have an input buffer of 1x100000, it will only take the first 1x256 as input and do the inference.
I feel like you have some incorrect understanding about the dynamic shape of TRT, I would suggest reading our developer guide first :)
Please refer to https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes
Is the input shape will be truncated due to the context shape being smaller?
Yes, it always uses the binding shape as the input, e.g. if you set the binding shape as 1x256, but you have an input buffer of 1x100000, it will only take the first 1x256 as input and do the inference.
I feel like you have some incorrect understanding about the dynamic shape of TRT, I would suggest reading our developer guide first :)
thx