TensorRT How to strictly limiting the maximum GPU memory usage and clear GPU memory cache?

My use case scenario is deploying model inference services in the cloud, utilizing GPU virtualization technology to split one GPU into multiple instances. Each instance runs a model, and since one card has a total of about 22GB of available memory, I divided it into 10 instances, with each instance allocated 2GB of memory. From the conversion logs of trtexec, I noticed that executing each model with TensorRT inference approximately consumes around 1.6GB of memory, so I thought allocating 2GB for each instance would be sufficient. However, during concurrent testing, it seems there was a memory overflow, which means the memory usage for each model inference exceeded 2GB. When converting the ONNX model to a TensorRT model, I set workspace=2048, so it shouldn't exceed 2GB, right? Therefore, how can I ensure that the maximum memory usage for model inference does not exceed the workspace? Additionally, I'm not sure if the increasing number of model inference calls led to a gradual increase in memory usage, ultimately resulting in an OOM error. So I considered clearing the cache after each inference. How to clear the GPU cache after the model has finished inference? Is it done using torch.cuda.empty_cache() or does TensorRT have its own related APIs? I have posted a issue here, there are some log infomation.

Oct 19 '24 17:10 EmmaThompson123

Q: How do I choose the optimal workspace size?
A: Some TensorRT algorithms require additional workspace on the GPU. The method IBuilderConfig::setMemoryPoolLimit() controls the maximum amount of workspace that can be allocated and prevents algorithms that require more workspace from being considered by the builder. At runtime, the space is allocated automatically when creating an IExecutionContext. The amount allocated is no more than is required, even if the amount set in IBuilderConfig::setMemoryPoolLimit() is much higher. Applications should, therefore, allow the TensorRT builder as much workspace as they can afford; at runtime, TensorRT allocates no more than this and typically less. The workspace size may need to be limited to less than the full device memory size if device memory is needed for other purposes during the engine build.

Oct 22 '24 09:10 lix19937

Please also see the memory section of the developer guide for more info: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#memory

Oct 28 '24 03:10 yuanyao-nv

Please also see the memory section of the developer guide for more info: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#memory

@yuanyao-nv Thank you, it is helpful !

Oct 29 '24 02:10 EmmaThompson123