Context

For models requiring fallback to Torch due to converter capabilities, custom operators, or other needs, each of the TRTEngine objects is compiled, initialized, inserted into the Torch nn.Module, and runtime-ready during compile time. This takes up an unnecessary amount of memory on the GPU at compile time.

Proposal

Use the GPU as a build space for TRTEngine objects, but do not deserialize or initialize the engines until the first forward pass, similar to what is done here: https://github.com/pytorch/TensorRT/blob/ad74a735056667726692c49a175a790647ef889e/py/torch_tensorrt/fx/trt_module.py#L25-L27

API Details

The TRTModule objects will take a parameter, construct_live=True, which can be specified to False if it is desired to initialize the engines at the first forward pass, thereby avoiding unnecessary usage of GPU space during compilation. After building the engine at compile time, the serialized object is moved to host memory until runtime, at which point it is initialized. check_initialized() is called at every forward pass, only having a measurable effect on the first pass of inference at which point the engines are moved from host to device memory for usage.

Mar 06 '24 21:03 gs-olive

Additional Ideas + Notes

Shard compileable subgraphs based on estimated memory cost during compilation
Better estimate the necessary workspace size

Apr 09 '24 18:04 gs-olive

@gs-olive would you ever want to set construct_live=False in the compile path?

It sounds like this feature reduces device memory pressure between compilation and execution at the cost of added latency on first execution -- is that right? Is there any change in the time to compile? How much latency do you expect this feature to add to first execution?

Apr 16 '24 21:04 laikhtewari

would you ever want to set construct_live=False in the compile path

construct_live=False would specify offloading the Engines to CPU whereas construct_live=True would be the current behavior. If one doesn't want to pay the overhead of initialization at runtime, or knows beforehand that the model will have only one engine, then it could be a useful toggle to have

It sounds like this feature reduces device memory pressure between compilation and execution at the cost of added latency on first execution

Yes, that is correct, with the key note that it would effectively allow each engine being built the full workspace (because other engines would be stored on host), then move to GPU + initialize all built engines at first run

Is there any change in the time to compile

Compile time could increase slightly here, as data will be moved from GPU to CPU memory, however the expected time for this operation is small relative to overall compilation

How much latency do you expect this feature to add to first execution

This is difficult to gauge without a prototype, and would depend on the time taken to move the engine(s) over to the GPU and initialize them. A rough estimate might be on the order of model size (approximately how long it would take to load the model from disk to GPU in the first place)

Apr 17 '24 00:04 gs-olive

TensorRT
TensorRT copied to clipboard

✨[Feature] Delayed Initialization for `TRTModule` Classes

Context

Proposal

API Details

Additional Ideas + Notes

TensorRT TensorRT copied to clipboard

✨[Feature] Delayed Initialization for `TRTModule` Classes

Context

Proposal

API Details

Additional Ideas + Notes

TensorRT
TensorRT copied to clipboard