✨[Feature] Basic GPU and CPU memory control workflow

Open cehongwang opened this issue 1 month ago • 0 comments

Problem Description

Torch-TensorRT compilation for large models (such as LLMs and diffusion models) can consume excessive CPU and GPU memory. This often leads to freezes, CUDA OOM errors, TensorRT compilation failures, or the operating system killing the process. The default behavior may use up to 5× the model size in CPU memory and 2× the model size in GPU memory, which is too high for many environments.

Solution

Provide compilation options that reduce redundant model copies on CPU/GPU memory. Specifically:

Enable a memory-trimming mechanism (export TRIM_CPU_MEMORY=1). Provide CPU offloading (offload_module_to_cpu=True) to move the original copy of the model to CPU to save GPU memory. Lazy engine initialization (lazy_engine_init) to save GPU memory for following compilation when there are graph breaks.

Setting	Effect	Approx. Memory Ratio
Default	Baseline behavior	CPU: 5×, GPU: 2×
`export TRIM_CPU_MEMORY=1`	Reduces redundant CPU copies	CPU: ~3×
`offload_module_to_cpu=False`	Further reduces CPU copies	CPU: ~2×
`offload_module_to_cpu=True`	Reduces GPU usage, increases CPU usage	GPU: ~1×, CPU: +1×
`lazy_engine_init=True`	Reduces GPU usage when there are multiple subgraphs	lower GPU memory

Proper configuration ensures efficient resource use, stable compilation, and predictable performance for large-scale models.

Nov 18 '25 22:11 cehongwang