TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

How to build TensorRT-LLM engine on host and deploy to Jetson Orin Nano Super?

Open Sesameisgod opened this issue 8 months ago • 4 comments

Hi, I’m currently working with TensorRT-LLM and trying to deploy a model (e.g., Qwen2-VL-2B-Instruct) on a Jetson Orin Nano Super. However, due to limited memory on the Nano, I’m unable to build the TensorRT engine directly on the device.

Is there any official or recommended approach to build the TensorRT-LLM engine on a more powerful host machine (with sufficient memory and GPU), and then transfer the generated engine file to the Jetson Orin Nano Super for inference?

If so, are there any considerations or compatibility issues I should be aware of when cross-building the engine on x86 and deploying it on Jetson (aarch64)?

Thanks in advance!

Sesameisgod avatar Mar 29 '25 12:03 Sesameisgod

@Sesameisgod Hi, TensorRT-LLM has two backends now, one based on TensorRT(the first workflow supported in TensorRT-LLM) and the other based on PyTorch(the new supported workflow since 0.17 release).

For TensorRT workflow, it requires AoT(ahead-of-time) phase tuning phase to select the best combination of kernel sequence, so though technically it is possible to build the TensorRT engine on another GPU sharing the similar HW architecture, it is not the most recommended way.

June

juney-nvidia avatar Mar 29 '25 13:03 juney-nvidia

  • @sunnyqgg for vis in case she may have more inputs on this question.

June

juney-nvidia avatar Mar 29 '25 13:03 juney-nvidia

Thank you for your response!

I’d like to follow up and ask — is there any recommended approach for building a TensorRT engine for Qwen2-VL-2B-Instruct directly on a Jetson Orin Nano Super (8GB RAM)?

I’ve tested running the model via Hugging Face Transformers on the Nano and it works successfully, which suggests that the model can run on the device.

However, the OOM issue occurs during the TensorRT engine building phase, are there any strategies (e.g., using swap) to make engine building feasible directly on the Nano?

Sesameisgod avatar Mar 29 '25 14:03 Sesameisgod

@Sesameisgod I am not aware that TensorRT engine building process can use swap memory during the offline engine building process.

An alternative way is that you can try to run Qwen2-VL model in the newly introduced PyTorch workflow:

  • https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/pytorch
  • https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/pytorch/quickstart_multimodal.py#L107
  • https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_qwen2vl.py

TensorRT-LLM PyTorch workflow has been introduced since 0.17 release and based on our internal performance evaluation, on popular models like LLaMA/Mistral/Mixtral, the PyTorch workflow performance is on-par(or even faster) with the TensorRT workflow, since the customized kernels are reused in both TensorRT workflow(as plugins) and PyTorch workflow(as torch custom op), also the existing C++ runtime building blocks(BatchManager/KV CacheManager/etc.) are also reused in both TensorRT and PyTorch workflow and more optimizations are added into the PyTorch workflow.

Currently we also shift more attention to the enhancement of the PyTorch workflow, such as the recently announced DeepSeek R1 performance numbers are all based on the PyTorch workflow.

What we cannot commit now is that up to now due to bandwidth limitation, we cannot commit official support for Jetson platform yet. So you need to try to run TensorRT-LLM on Jetson to observe the behavior.

Thanks June

juney-nvidia avatar Mar 30 '25 01:03 juney-nvidia

Got it, I’ll try running Qwen2-VL with the PyTorch workflow on the Jetson Orin Nano Super and see how it performs. Really appreciate your help!

Sesameisgod avatar Mar 30 '25 10:03 Sesameisgod

@Sesameisgod to ensure you are aware of this Qwen2.5-VL effort from @yechank-nvidia

https://github.com/NVIDIA/TensorRT-LLM/pull/3156/files

Thanks June

juney-nvidia avatar Mar 31 '25 03:03 juney-nvidia

Hi @Sesameisgod,

  1. You can use swap memory during engine building process, but in my experience, if you allocate over 8GB, the system locks up. The TRT engine generation requires about 4 times the memory of the model size.
  2. You can try to build W4A16(INT4) model engine.
  3. As we discussed below, you can use the same TRT and TRT-LLM with Jetson orin(64GB) to build the engine and run it with Jetson Orin Nano Super
  4. For saving memory usage in the inference phase, you can use -mmap, pls refer https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0-jetson/README4Jetson.md#3-reference-memory-usage
  5. Actually we have branch for Jeston device:https://github.com/NVIDIA/TensorRT-LLM/tree/v0.12.0-jetson, but unfortunately it doesn't support Qwen2-VL.

Thanks, Sunny.

sunnyqgg avatar Mar 31 '25 07:03 sunnyqgg

@juney-nvidia That's great to hear! It's really nice to see support for the latest models coming so quickly. Qwen is truly an impressive VLM series.

@sunnyqgg Understood, thank you. For now, we're not considering quantization yet. We're planning to explore building the engine on a Jetson device with 64GB of memory — so far, it seems like Jetson AGX Orin is the only one that meets this requirement, and we’re planning to purchase one for testing.

Also, as you mentioned, TensorRT-LLM v0.12 doesn't support Qwen2-VL, so I'm currently using the aarch64 Docker image from this repository. After running the container and executing the import tensorrt_llm command, the version appears correctly, so it seems to be running, though further testing is needed to confirm whether everything works properly.

If there’s any progress later on, I’ll be happy to share it here. Thanks again for your help!

Sesameisgod avatar Mar 31 '25 16:03 Sesameisgod

Hi @Sesameisgod , I am also doing the same thing as you, but I encountered a big problem. My board is Jetson NX (8G), and I ran Vila-3b (mlc), which is quite fast, but my understanding ability is a bit poor. Now I am trying to run qwen2.5vl and qwen2vl. Tensor-LLM only provided Jetson with v0.12 version, and the configuration environment is very troublesome for me. Unfortunately, your Docker that talks about Tensor LLM cannot be pulled in China

xiaohuihuige avatar Apr 01 '25 02:04 xiaohuihuige

Hi @xiaohuihuige , Have you tried using a VPN? or maybe I can share the Docker image with you. However, I'm not quite sure what's the best way to transfer it.

Sesameisgod avatar Apr 01 '25 14:04 Sesameisgod

Hi @Sesameisgod ,if pull it on my jetson orin nano 8gb can i convert the qwen 2 vl 2b to tensorrt and run the inference over it?

garvitpathak avatar Apr 07 '25 19:04 garvitpathak

Hi @Sesameisgod , can you tell me how you install latest version of tensort llm on your jetson orin nano?

garvitpathak avatar Apr 08 '25 17:04 garvitpathak

Hi @garvitpathak,

I didn’t install the latest version of TensorRT-LLM directly on the Jetson Orin Nano. Instead, I used the ARM64 image provided by Trystan on Docker Hub (https://hub.docker.com/r/trystan/tensorrt_llm/tags). You can start by pulling the image with:

docker pull trystan/tensorrt_llm:aarch64-0.17.0.post1_90

Then, I used jetson-containers to run the container, which saves the trouble of manually setting a lot of parameters. You can do it like this:

# Navigate to the jetson-containers repo
cd jetson-containers
# Use run.sh to start the container
./run.sh trystan/tensorrt_llm:aarch64-0.17.0.post1_90

Once you're inside the container, try running:

python3 -c "import tensorrt_llm"

If everything is set up correctly, it should print the version number of TensorRT-LLM (which is 0.17.0 in this container).

The official guide (https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) also includes instructions on how to build the image from source, but I haven’t tried that on the Jetson Orin Nano yet.

Unfortunately, as I mentioned earlier, I ran into an OOM issue and couldn’t successfully convert the original precision Qwen2 VL model to an .engine file. I also haven’t tested whether inference actually works — so far, I’ve only confirmed that importing tensorrt_llm succeeds. Maybe you can try building an INT4 model as suggested by sunnyqgg.

That’s my current progress — hope it helps!

Sesameisgod avatar Apr 09 '25 04:04 Sesameisgod

Hi @Sesameisgod , Can you help me with this jetson orin nano 8GB, as int 4 is not available for qwen 2 vl 2b if yes can you provide the link to me for conversion?

garvitpathak avatar Apr 09 '25 10:04 garvitpathak

Hi @xiaohuihuige , I am trying to run the vila 3b but facing llama lava error key error while converting to tensorrt for quantisation can you help me with that?

garvitpathak avatar Apr 11 '25 08:04 garvitpathak

Hi @garvitpathak, The official team has released an INT4 version of the model using GPTQ on Hugging Face (https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4), but I haven’t personally tried this model yet.

BTW, I’ve successfully deployed a Docker image with the latest version of TensorRT-LLM on the Jetson Orin Nano using the official guide (https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html). For reference, I’m using a 1TB SSD as the system drive along with 64GB of SWAP.

Sesameisgod avatar Apr 11 '25 12:04 Sesameisgod

Hi @Sesameisgod can you tell me how you create a swap memory. A step by step instructions to do so?

garvitpathak avatar Apr 11 '25 15:04 garvitpathak

Hello, I'm currently trying to run TRT-LLM on a Nano Super 8G as well. I'm attempting to use the officially recommended Meta-Llama-3-8B-Instruct INT4-GPTQ, but it's clearly failing due to insufficient memory. I've tried adding 20GB of SWAP, but it doesn't seem to be utilized. I'd like to ask what kind of model I should switch to so that I can successfully build the engine. I don't have any other more powerful Jetson devices.

Shattered217 avatar Jul 03 '25 07:07 Shattered217

Hi, Can you try to use Pytorch backend? And for TRT backend, the engine building will consume 4X memory of your model size.

Thanks.

sunnyqgg avatar Jul 03 '25 07:07 sunnyqgg

Hi, Can you try to use Pytorch backend? And for TRT backend, the engine building will consume 4X memory of your model size.

Thanks.

How to do that. I got the same trouble.

make7s avatar Aug 07 '25 07:08 make7s

On ThorU , DriveOS: 7.0.2.0、TensorRT: 10.10、CUDA:12.8

Image

lix19937 avatar Oct 15 '25 13:10 lix19937

awesome https://nvidia.github.io/TensorRT-LLM/0.20.0/reference/support-matrix.html

lix19937 avatar Oct 15 '25 13:10 lix19937

awesome https://nvidia.github.io/TensorRT-LLM/0.20.0/reference/support-matrix.html

Hi,

Great! Seems like you've already made Qwen2.5VL work on Thor! Is that in Pytorch Backend? Cuz I see there is no TensorRT backend support for Qwen2.5VL?

Thanks!

Image Image

fwzdev1 avatar Oct 25 '25 10:10 fwzdev1

@Sesameisgod anything you can share on how you built the most recent version of trt-llm for jetson orin nano super?

tjsheth76 avatar Oct 27 '25 18:10 tjsheth76