TensorRT
TensorRT copied to clipboard
✨[Feature] Delayed Initialization for `TRTModule` Classes
Context
For models requiring fallback to Torch due to converter capabilities, custom operators, or other needs, each of the TRTEngine
objects is compiled, initialized, inserted into the Torch nn.Module
, and runtime-ready during compile time. This takes up an unnecessary amount of memory on the GPU at compile time.
Proposal
Use the GPU as a build space for TRTEngine
objects, but do not deserialize or initialize the engines until the first forward pass, similar to what is done here:
https://github.com/pytorch/TensorRT/blob/ad74a735056667726692c49a175a790647ef889e/py/torch_tensorrt/fx/trt_module.py#L25-L27
API Details
The TRTModule
objects will take a parameter, construct_live=True
, which can be specified to False
if it is desired to initialize the engines at the first forward
pass, thereby avoiding unnecessary usage of GPU space during compilation. After building the engine at compile time, the serialized object is moved to host memory until runtime, at which point it is initialized. check_initialized()
is called at every forward
pass, only having a measurable effect on the first pass of inference at which point the engines are moved from host to device memory for usage.
Additional Ideas + Notes
- Shard compileable subgraphs based on estimated memory cost during compilation
- Better estimate the necessary workspace size
@gs-olive would you ever want to set construct_live=False
in the compile path?
It sounds like this feature reduces device memory pressure between compilation and execution at the cost of added latency on first execution -- is that right? Is there any change in the time to compile? How much latency do you expect this feature to add to first execution?
would you ever want to set
construct_live=False
in the compile path
construct_live=False
would specify offloading the Engines to CPU whereas construct_live=True
would be the current behavior. If one doesn't want to pay the overhead of initialization at runtime, or knows beforehand that the model will have only one engine, then it could be a useful toggle to have
It sounds like this feature reduces device memory pressure between compilation and execution at the cost of added latency on first execution
Yes, that is correct, with the key note that it would effectively allow each engine being built the full workspace (because other engines would be stored on host), then move to GPU + initialize all built engines at first run
Is there any change in the time to compile
Compile time could increase slightly here, as data will be moved from GPU to CPU memory, however the expected time for this operation is small relative to overall compilation
How much latency do you expect this feature to add to first execution
This is difficult to gauge without a prototype, and would depend on the time taken to move the engine(s) over to the GPU and initialize them. A rough estimate might be on the order of model size (approximately how long it would take to load the model from disk to GPU in the first place)