Concurrent inference failure of TensorRT 8.6.1 when running open_clip visual model tensorrt engine on GPU A100
Description
I compiled the image part of the open_clip model (a PyTorch model,https://github.com/mlfoundations/open_clip) in a Python environment using TensorRT 8.6.1, and obtained an engine. Then, I developed a service that loads the TensorRT engine, accepts HTTP POST requests, performs inference, and returns results. This service is written in Python, not C++. Here are the phenomena I observed:
- When I send only one request at a time, the service functions normally, and the model can infer and return results correctly.
- When I send 5 requests at the same time (concurrent requests using a Python process pool), the model errors out. From what I've read, it seems that the TensorRT engine is not thread-safe during concurrent inference. What should I do to enable the model to support concurrent requests?
# post_request_process.py
from multiprocessing import Pool
from typing import Dict, List
from tqdm import tqdm
import json
import requests
def post_req(param_dict: Dict):
url = param_dict['url']
json_data = param_dict["json_data"]
headers=param_dict["headers"]
res = requests.post(url, headers=headers, json=json_data).text
return res
def multi_process(url: str, json_data: Dict,headers:Dict[str,str]):
param_list = [{"url": url, "json_data": json_data,"headers":headers}] * 10
pool = Pool(processes=5)
tqdm_kwargs = dict(total=len(param_list), desc=f'cal video total time')
res_list: List = []
for res in tqdm(pool.imap_unordered(post_req, param_list), **tqdm_kwargs):
res_list.append(res)
pool.close()
pool.join()
return res_list
url = 'http://192.168.0.198:8001'
with open('/home/dengxiaoyu/PycharmProjects/rxzn/eas_demo/open_clip_trt_img/tests/img.json', 'r') as file:
data = json.load(file)
headers={"content-type": "application/json"}
res_list = multi_process(url, data, headers)
error_cnt=0
for i,res in enumerate(res_list):
print(f"res={res}")
Environment
TensorRT Version: python==3.8,tensorrt==8.6.1
NVIDIA GPU:A 100,80G
NVIDIA Driver Version:535.54.03
CUDA Version:12.2
CUDNN Version:
Operating System:ubuntu 20.04
Python Version (if applicable):3.8
Tensorflow Version (if applicable):
PyTorch Version (if applicable):1.13.1+cu116
Baremetal or Container (if so, version):
Relevant Files
Model link:https://github.com/mlfoundations/open_clip
Steps To Reproduce
Commands or scripts:
Have you tried the latest release?:No
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):
what is your trt infer code ?
what is your trt infer code ?
infer code like this. I didn't set up any multi-process or multi-threaded inference operations in the code.:
from openclip_trt.tensorrt_utils import TensorRTModel
txt_trt_model_path="/media/star/8T/model/clip/open_clip/tensorrt-8.6.1/a100/trt/python/batch_dynamic/ViT-bigG-14.txt.fp32.trt.engine"
txt_trt_model = TensorRTModel(txt_trt_model_path)
texts = [
"NVIDIA TensorRT is an SDK that facilitates high-performance machine learning inference",
"It is designed to work in a complementary fashion with training frameworks such as TensorFlow, PyTorch, and MXNet.",
"It focuses specifically on running an already-trained network quickly and efficiently on NVIDIA hardware.",
"hello world",
"Xi Jinping Thou ght on Culture reveals the outstanding characteristics of Chinese civilization and discusses the theories, principles and philosophy of cultural exchanges.",
"According to Xi Jinping Thought on Culture, civilizational exchanges can transcend barriers and conflicts, and inter-civilizational interactions can boost the harmonious development of civilizations",
"No civilization can exist independently, or by refusing to interact with other civilizations",
"The coexistence of and exchanges between civilizations are the norm, with all civilizations moving toward a harmonious future.",
"Marxism reveals the characteristics of human civilization.",
"Science and technology play a fundamental role in transforming agriculture and enhancing food security",
"Sun said small-scale farming is common in both China and many African countries",
"The academy cooperates with 23 African countries and nine international organizations",
"By helping to build biogas facilities and conduct technology demonstrations in countries such as Tanzania, Mauritania and Angola, the academy has supported the adoption of renewable energy sources and promoted resource efficiency in agricultural production.",
"Rather than looking to other regions with different contexts, it would be more beneficial for African nations to glean insights and experience from China's journey, given the shared historical challenges and the success China has achieved"
]
tokenizer = open_clip.get_tokenizer("ViT-bigG-14")
texts_token=tokenizer(texts).cuda()
trt_text_features = txt_trt_model(inputs={'text': texts_token})['unnorm_text_features']
TensorRTModel
class TensorRTModel(object):
def __init__(self, engine_path):
print(f'load engine_path is {engine_path}')
self.engine = self.load_engine(engine_path)
assert self.engine
profile_index = 0
self.context = self.engine.create_execution_context()
self.context.set_optimization_profile_async(
profile_index=profile_index, stream_handle=torch.cuda.current_stream().cuda_stream
)
self.input_binding_idxs, self.output_binding_idxs = get_binding_idxs(self.engine, profile_index)
def load_engine(self, engine_file_path):
assert os.path.exists(engine_file_path)
print("Reading engine from file {}".format(engine_file_path))
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
# with open(engine_file_path, "rb") as f, trt.Runtime(trt.Logger(trt.Logger.ERROR)) as runtime:
# engine = runtime.deserialize_cuda_engine(f.read())
# return engine
def __call__(self, inputs, time_buffer=None):
input_tensors: List[torch.Tensor] = list()
for i in range(self.context.engine.num_bindings):
if not self.context.engine.binding_is_input(index=i):
continue
tensor_name = self.context.engine.get_binding_name(i)
assert tensor_name in inputs, f"input not provided: {tensor_name}"
tensor = inputs[tensor_name]
assert isinstance(tensor, torch.Tensor), f"unexpected tensor class: {type(tensor)}"
assert tensor.device.type == "cuda", f"unexpected device type (trt only works on CUDA): {tensor.device.type}"
# warning: small changes in output if int64 is used instead of int32
if tensor.dtype in [torch.int64, torch.long]:
# logging.warning(f"using {tensor.dtype} instead of int32 for {tensor_name}, will be casted to int32")
tensor = tensor.type(torch.int32)
input_tensors.append(tensor)
# calculate input shape, bind it, allocate GPU memory for the output
outputs: Dict[str, torch.Tensor] = get_output_tensors(
self.context, input_tensors, self.input_binding_idxs, self.output_binding_idxs
)
bindings = [int(i.data_ptr()) for i in input_tensors + list(outputs.values())]
if time_buffer is None:
self.context.execute_v2(bindings=bindings)
else:
with track_infer_time(time_buffer):
self.context.execute_v2(bindings=bindings)
torch.cuda.current_stream().synchronize() # sync all CUDA ops
return outputs
If your all requests are send to one process(include trt infer) is no problem.
@AmazDeng I think you didn't show how you hook the TensorRTModel to your Python multiprocessing pool. The issue is that a single TensorRT engine is not supposed to be held by multiple processes. Without knowing much about your use case, I think you can do one of the following:
- On the service side, make sure there is a single process that owns the TRT engine. I'm not entirely sure how to do that with a Python REST API service though.
- On the service side, create a new execution context every time you are handling a request.
@LeoZDong
Does TensorRT support multithreaded inference? Note that it's multithreading, not multiprocessing?
@LeoZDong
Does TensorRT support multithreaded inference? Note that it's multithreading, not multiprocessing?
Yes. Look at the execute_async_v3 method (https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/ExecutionContext.html#tensorrt.IExecutionContext.execute_async_v3). You will need to pass in a CUDA stream for thread handling.