TensorRT-LLM[Branch v0.12.0-jetson] [trtllm-build killed due to insufficient memory][Phi-3-medium-128k-instruct]
Greetings, everyone. 0. I am trying to use TensorRT-LLM[Branch v0.12.0-jetson] to deploy the microsoft--Phi-3-medium-128k-instruct LLM. The guidance could be found here: https://github.com/NVIDIA/TensorRT-LLM/tree/v0.12.0-jetson/examples/phi
- The scripts are listed as following:
1.1 this script has ended successfully.
python ./convert_checkpoint.py
--model_dir /home/nvidia/.cache/huggingface/hub/models--microsoft--Phi-3-medium-128k-instruct/snapshots/fa7d2aa4f5ea69b2e36b20d050cdae79c9bfbb3f
--output_dir ./phi-checkpoint-float16
--dtype float16
1.2 this command FAILED.
trtllm-build
--checkpoint_dir ./phi-checkpoint-float16
--output_dir ./phi-engine
--gemm_plugin float16
--max_batch_size 8
--max_input_len 1024
--max_seq_len 2048
[03/22/2025-00:10:15] [TRT-LLM] [W] Implicitly setting Phi3Config.longrope_scaling_short_factors = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.01, 1.02, 1.02, 1.04, 1.04, 1.07, 1.07, 1.1, 1.3000000000000003, 1.3000000000000003, 1.5000000000000004, 1.5700000000000005, 1.9000000000000008, 2.3100000000000014, 2.759999999999992, 3.3899999999999784, 3.9399999999999666, 4.009999999999965, 4.289999999999959, 4.349999999999958, 5.349999999999937, 6.659999999999909, 7.029999999999901, 7.51999999999989, 8.00999999999988, 8.249999999999876, 8.279999999999875, 9.629999999999846, 9.89999999999984, 10.589999999999826, 11.049999999999816, 11.7899999999998, 12.189999999999792, 12.889999999999777, 13.129999999999772, 13.16999999999977, 13.20999999999977, 13.479999999999764, 13.539999999999763, 13.779999999999758, 13.929999999999755, 14.429999999999744, 14.759999999999737, 15.149999999999729, 15.419999999999723, 15.53999999999972, 15.659999999999718, 15.749999999999716, 15.759999999999716, 15.799999999999715, 16.05999999999971, 16.079999999999714, 16.11999999999972, 16.11999999999972, 16.18999999999973, 16.31999999999975, 16.539999999999786, 16.799999999999827] [03/22/2025-00:10:15] [TRT-LLM] [W] Implicitly setting Phi3Config.longrope_scaling_long_factors = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.25, 1.25, 1.5, 2.0, 2.75, 5.75, 5.75, 6.5, 9.25, 11.0, 13.25, 19.25, 19.75, 19.75, 21.25, 21.5, 26.5, 30.0, 33.75, 35.25, 38.5, 42.0, 42.25, 46.0, 47.0, 50.0, 50.5, 51.0, 52.0, 52.75, 53.75, 54.75, 57.0, 57.25, 58.5, 59.25, 59.5, 62.0, 62.5, 62.75, 63.25, 63.25, 63.25, 63.75, 64.0, 64.0, 64.25, 64.5, 64.5, 65.0, 65.0] [03/22/2025-00:10:16] [TRT-LLM] [I] Set dtype to float16. [03/22/2025-00:10:16] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[03/22/2025-00:10:16] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored [03/22/2025-00:10:16] [TRT] [I] [MemUsageChange] Init CUDA: CPU +12, GPU +0, now: CPU 166, GPU 10817 (MiB) [03/22/2025-00:10:18] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +927, GPU +755, now: CPU 1136, GPU 11617 (MiB) [03/22/2025-00:10:18] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [03/22/2025-00:10:18] [TRT-LLM] [I] Set nccl_plugin to None. [03/22/2025-00:10:20] [TRT-LLM] [I] Total optimization profiles added: 1 [03/22/2025-00:10:20] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [03/22/2025-00:10:20] [TRT] [W] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU [03/22/2025-00:10:20] [TRT] [W] Unused Input: position_ids [03/22/2025-00:10:20] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [03/22/2025-00:10:20] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/0/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/1/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/2/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/3/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/4/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/5/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/6/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/7/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/8/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/9/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/10/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/11/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/12/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/13/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/14/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/15/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/16/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/17/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/18/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/19/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/20/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/21/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/22/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/23/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/24/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/25/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/26/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/27/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/28/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/29/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/30/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/31/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/32/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/33/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/34/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/35/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/36/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/37/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/38/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:20] [TRT] [W] Was not able to infer a kOPT value for tensor Phi3ForCausalLM/transformer/layers/39/attention/max_L3090/reduce_L3036/REDUCE_MAX_0_output_0. Using one(s). [03/22/2025-00:10:25] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [03/22/2025-00:10:25] [TRT] [I] Detected 15 inputs and 1 output network tensors. [03/22/2025-00:11:12] [TRT] [I] Total Host Persistent Memory: 111680 [03/22/2025-00:11:12] [TRT] [I] Total Device Persistent Memory: 0 [03/22/2025-00:11:12] [TRT] [I] Total Scratch Memory: 167804928 [03/22/2025-00:11:12] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 856 steps to complete. [03/22/2025-00:11:13] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 816.187ms to assign 136 blocks to 856 nodes requiring 1214297600 bytes. [03/22/2025-00:11:13] [TRT] [I] Total Activation Memory: 1214290432 [03/22/2025-00:11:13] [TRT] [I] Total Weights Memory: 28054694912 [03/22/2025-00:11:13] [TRT] [I] Engine generation completed in 53.2362 seconds. [03/22/2025-00:11:13] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 64 MiB, GPU 49152 MiB [1] 7334 killed trtllm-build --checkpoint_dir ./phi-checkpoint-float16 --output_dir float16
I guess someone may suggest me to find a GPU with HUGE VRAM to bypass this issue. Here is my point:
- The Jetson AGX Orin is using UMA architecture, which means it has about 60GB VRAM.
- The GPU has more than 60GB VRAM, like 80GB VRAM, is very high-end GPU, which is quite unlikely available to me.
- Here comes my question: if the Jetson AGX Orin[64GB model] is the only equipment available to me at this moment, does there exist any approach to overcome this issue if we still want to use Phi-3-medium-128k-instruct, not other model and keeping the float16 precision?
Discussion is highly welcomed here. Thank you very much for any hint or information.
sounds mission impossible ..