How to build TensorRT-LLM engine on host and deploy to Jetson Orin Nano Super?
Hi, I’m currently working with TensorRT-LLM and trying to deploy a model (e.g., Qwen2-VL-2B-Instruct) on a Jetson Orin Nano Super. However, due to limited memory on the Nano, I’m unable to build the TensorRT engine directly on the device.
Is there any official or recommended approach to build the TensorRT-LLM engine on a more powerful host machine (with sufficient memory and GPU), and then transfer the generated engine file to the Jetson Orin Nano Super for inference?
If so, are there any considerations or compatibility issues I should be aware of when cross-building the engine on x86 and deploying it on Jetson (aarch64)?
Thanks in advance!
@Sesameisgod Hi, TensorRT-LLM has two backends now, one based on TensorRT(the first workflow supported in TensorRT-LLM) and the other based on PyTorch(the new supported workflow since 0.17 release).
For TensorRT workflow, it requires AoT(ahead-of-time) phase tuning phase to select the best combination of kernel sequence, so though technically it is possible to build the TensorRT engine on another GPU sharing the similar HW architecture, it is not the most recommended way.
June
- @sunnyqgg for vis in case she may have more inputs on this question.
June
Thank you for your response!
I’d like to follow up and ask — is there any recommended approach for building a TensorRT engine for Qwen2-VL-2B-Instruct directly on a Jetson Orin Nano Super (8GB RAM)?
I’ve tested running the model via Hugging Face Transformers on the Nano and it works successfully, which suggests that the model can run on the device.
However, the OOM issue occurs during the TensorRT engine building phase, are there any strategies (e.g., using swap) to make engine building feasible directly on the Nano?
@Sesameisgod I am not aware that TensorRT engine building process can use swap memory during the offline engine building process.
An alternative way is that you can try to run Qwen2-VL model in the newly introduced PyTorch workflow:
- https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/pytorch
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/pytorch/quickstart_multimodal.py#L107
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_qwen2vl.py
TensorRT-LLM PyTorch workflow has been introduced since 0.17 release and based on our internal performance evaluation, on popular models like LLaMA/Mistral/Mixtral, the PyTorch workflow performance is on-par(or even faster) with the TensorRT workflow, since the customized kernels are reused in both TensorRT workflow(as plugins) and PyTorch workflow(as torch custom op), also the existing C++ runtime building blocks(BatchManager/KV CacheManager/etc.) are also reused in both TensorRT and PyTorch workflow and more optimizations are added into the PyTorch workflow.
Currently we also shift more attention to the enhancement of the PyTorch workflow, such as the recently announced DeepSeek R1 performance numbers are all based on the PyTorch workflow.
What we cannot commit now is that up to now due to bandwidth limitation, we cannot commit official support for Jetson platform yet. So you need to try to run TensorRT-LLM on Jetson to observe the behavior.
Thanks June
Got it, I’ll try running Qwen2-VL with the PyTorch workflow on the Jetson Orin Nano Super and see how it performs. Really appreciate your help!
@Sesameisgod to ensure you are aware of this Qwen2.5-VL effort from @yechank-nvidia
https://github.com/NVIDIA/TensorRT-LLM/pull/3156/files
Thanks June
Hi @Sesameisgod,
- You can use swap memory during engine building process, but in my experience, if you allocate over 8GB, the system locks up. The TRT engine generation requires about 4 times the memory of the model size.
- You can try to build W4A16(INT4) model engine.
- As we discussed below, you can use the same TRT and TRT-LLM with Jetson orin(64GB) to build the engine and run it with Jetson Orin Nano Super
- For saving memory usage in the inference phase, you can use -mmap, pls refer https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0-jetson/README4Jetson.md#3-reference-memory-usage
- Actually we have branch for Jeston device:https://github.com/NVIDIA/TensorRT-LLM/tree/v0.12.0-jetson, but unfortunately it doesn't support Qwen2-VL.
Thanks, Sunny.
@juney-nvidia That's great to hear! It's really nice to see support for the latest models coming so quickly. Qwen is truly an impressive VLM series.
@sunnyqgg Understood, thank you. For now, we're not considering quantization yet. We're planning to explore building the engine on a Jetson device with 64GB of memory — so far, it seems like Jetson AGX Orin is the only one that meets this requirement, and we’re planning to purchase one for testing.
Also, as you mentioned, TensorRT-LLM v0.12 doesn't support Qwen2-VL, so I'm currently using the aarch64 Docker image from this repository. After running the container and executing the import tensorrt_llm command, the version appears correctly, so it seems to be running, though further testing is needed to confirm whether everything works properly.
If there’s any progress later on, I’ll be happy to share it here. Thanks again for your help!
Hi @Sesameisgod , I am also doing the same thing as you, but I encountered a big problem. My board is Jetson NX (8G), and I ran Vila-3b (mlc), which is quite fast, but my understanding ability is a bit poor. Now I am trying to run qwen2.5vl and qwen2vl. Tensor-LLM only provided Jetson with v0.12 version, and the configuration environment is very troublesome for me. Unfortunately, your Docker that talks about Tensor LLM cannot be pulled in China
Hi @xiaohuihuige , Have you tried using a VPN? or maybe I can share the Docker image with you. However, I'm not quite sure what's the best way to transfer it.
Hi @Sesameisgod ,if pull it on my jetson orin nano 8gb can i convert the qwen 2 vl 2b to tensorrt and run the inference over it?
Hi @Sesameisgod , can you tell me how you install latest version of tensort llm on your jetson orin nano?
Hi @garvitpathak,
I didn’t install the latest version of TensorRT-LLM directly on the Jetson Orin Nano. Instead, I used the ARM64 image provided by Trystan on Docker Hub (https://hub.docker.com/r/trystan/tensorrt_llm/tags). You can start by pulling the image with:
docker pull trystan/tensorrt_llm:aarch64-0.17.0.post1_90
Then, I used jetson-containers to run the container, which saves the trouble of manually setting a lot of parameters. You can do it like this:
# Navigate to the jetson-containers repo
cd jetson-containers
# Use run.sh to start the container
./run.sh trystan/tensorrt_llm:aarch64-0.17.0.post1_90
Once you're inside the container, try running:
python3 -c "import tensorrt_llm"
If everything is set up correctly, it should print the version number of TensorRT-LLM (which is 0.17.0 in this container).
The official guide (https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html) also includes instructions on how to build the image from source, but I haven’t tried that on the Jetson Orin Nano yet.
Unfortunately, as I mentioned earlier, I ran into an OOM issue and couldn’t successfully convert the original precision Qwen2 VL model to an .engine file. I also haven’t tested whether inference actually works — so far, I’ve only confirmed that importing tensorrt_llm succeeds. Maybe you can try building an INT4 model as suggested by sunnyqgg.
That’s my current progress — hope it helps!
Hi @Sesameisgod , Can you help me with this jetson orin nano 8GB, as int 4 is not available for qwen 2 vl 2b if yes can you provide the link to me for conversion?
Hi @xiaohuihuige , I am trying to run the vila 3b but facing llama lava error key error while converting to tensorrt for quantisation can you help me with that?
Hi @garvitpathak, The official team has released an INT4 version of the model using GPTQ on Hugging Face (https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4), but I haven’t personally tried this model yet.
BTW, I’ve successfully deployed a Docker image with the latest version of TensorRT-LLM on the Jetson Orin Nano using the official guide (https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html). For reference, I’m using a 1TB SSD as the system drive along with 64GB of SWAP.
Hi @Sesameisgod can you tell me how you create a swap memory. A step by step instructions to do so?
Hello, I'm currently trying to run TRT-LLM on a Nano Super 8G as well. I'm attempting to use the officially recommended Meta-Llama-3-8B-Instruct INT4-GPTQ, but it's clearly failing due to insufficient memory. I've tried adding 20GB of SWAP, but it doesn't seem to be utilized. I'd like to ask what kind of model I should switch to so that I can successfully build the engine. I don't have any other more powerful Jetson devices.
Hi, Can you try to use Pytorch backend? And for TRT backend, the engine building will consume 4X memory of your model size.
Thanks.
Hi, Can you try to use Pytorch backend? And for TRT backend, the engine building will consume 4X memory of your model size.
Thanks.
How to do that. I got the same trouble.
On ThorU , DriveOS: 7.0.2.0、TensorRT: 10.10、CUDA:12.8
awesome https://nvidia.github.io/TensorRT-LLM/0.20.0/reference/support-matrix.html
awesome https://nvidia.github.io/TensorRT-LLM/0.20.0/reference/support-matrix.html
Hi,
Great! Seems like you've already made Qwen2.5VL work on Thor! Is that in Pytorch Backend? Cuz I see there is no TensorRT backend support for Qwen2.5VL?
Thanks!
@Sesameisgod anything you can share on how you built the most recent version of trt-llm for jetson orin nano super?