TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

✨[Feature] Basic GPU and CPU memory control workflow

Open cehongwang opened this issue 1 month ago • 0 comments

Problem Description

Torch-TensorRT compilation for large models (such as LLMs and diffusion models) can consume excessive CPU and GPU memory. This often leads to freezes, CUDA OOM errors, TensorRT compilation failures, or the operating system killing the process. The default behavior may use up to 5× the model size in CPU memory and 2× the model size in GPU memory, which is too high for many environments.

Solution

Provide compilation options that reduce redundant model copies on CPU/GPU memory. Specifically:

Enable a memory-trimming mechanism (export TRIM_CPU_MEMORY=1). Provide CPU offloading (offload_module_to_cpu=True) to move the original copy of the model to CPU to save GPU memory. Lazy engine initialization (lazy_engine_init) to save GPU memory for following compilation when there are graph breaks.

Setting Effect Approx. Memory Ratio
Default Baseline behavior CPU: 5×, GPU: 2×
export TRIM_CPU_MEMORY=1 Reduces redundant CPU copies CPU: ~3×
offload_module_to_cpu=False Further reduces CPU copies CPU: ~2×
offload_module_to_cpu=True Reduces GPU usage, increases CPU usage GPU: ~1×, CPU: +1×
lazy_engine_init=True Reduces GPU usage when there are multiple subgraphs lower GPU memory

Proper configuration ensures efficient resource use, stable compilation, and predictable performance for large-scale models.

cehongwang avatar Nov 18 '25 22:11 cehongwang