Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction)

Open alexnorell opened this issue 1 month ago • 0 comments

Description

This PR optimizes the Jetson 6.2.0 Docker image by replacing the full l4t-jetpack base image with a lighter l4t-cuda:12.6.11-runtime base. This results in a 41.7% size reduction (14.2 GB → 8.28 GB) while maintaining full functionality and providing a newer CUDA version.

Key Improvements

Image Optimization

Size reduction: 14.2 GB → 8.28 GB (5.92 GB savings, 41.7% reduction)
Base image: l4t-cuda:12.6.11-runtime instead of l4t-jetpack:r36.4.0
CUDA version: Upgraded from 12.2 to 12.6.11
Build architecture: 2-stage multi-stage build (JetPack builder + minimal CUDA runtime)

Software Stack

onnxruntime-gpu: Compiled from source with CUDA 12.6 and TensorRT support
GDAL: 3.11.5 compiled from source using Ninja build system
PyTorch: 2.8.0 with CUDA 12.6 support from jetson-ai-lab.io
cuDNN: 9.3 extracted from JetPack for PyTorch compatibility
TensorRT: FP16 acceleration enabled by default

Performance Features

TensorRT execution provider enabled by default
FP16 precision support for faster inference
Engine caching to avoid recompilation on subsequent runs
Python symlink for inference CLI compatibility

Benchmark Results

RF-DETR Base model benchmarked on NVIDIA Jetson Orin:

Command used:

ssh roboflow@ubuntu 'sudo docker exec test-fresh inference benchmark python-package-speed -m rfdetr-base -d coco -bi 1000 -o /tmp/rfdetr_trt_benchmark.json'

Results:

Throughput: 27.2 FPS
Average Latency: 36.8 ms
Latency Std Dev: ±1.1 ms (very consistent)
Error Rate: 0.0% (1000/1000 successful inferences)
Percentiles:
- P50: 37.0 ms
- P75: 37.4 ms
- P90: 38.6 ms
- P95: 38.7 ms
- P99: 38.9 ms

Test Configuration:

Model: rfdetr-base (29M parameters)
Dataset: COCO (8 validation images)
Batch size: 1
Input resolution: 560x560
Total inferences: 1,000
Warm-up: 10 inferences
Execution providers: TensorRT, CUDA, CPU

Technical Details

Multi-stage Build Architecture

Builder Stage (l4t-jetpack:r36.4.0):
- Compiles GDAL 3.11.5 from source
- Builds onnxruntime-gpu with CUDA and TensorRT support
- Installs all Python dependencies with uv
- Builds inference packages (core, gpu, cli, sdk)
Runtime Stage (l4t-cuda:12.6.11-runtime):
- Minimal CUDA runtime with only necessary libraries
- Copies compiled GDAL binaries
- Copies cuDNN and TensorRT libs from builder
- Copies Python packages and CLI tools
- No development packages or build tools

Environment Variables

ONNXRUNTIME_EXECUTION_PROVIDERS=TensorrtExecutionProvider
ORT_TENSORRT_FP16_ENABLE=1
ORT_TENSORRT_ENGINE_CACHE_ENABLE=1
ORT_TENSORRT_ENGINE_CACHE_PATH=/tmp/ort_cache
REQUIRED_ONNX_PROVIDERS=TensorrtExecutionProvider

Type of change

[x] New feature (non-breaking change which adds functionality)
[x] Performance improvement (reduces image size, maintains functionality)

How has this change been tested?

Build Testing:
- Successfully built on Jetson Orin in MAXN mode
- Build time: ~10 minutes with warm cache
- All dependencies installed correctly
Runtime Testing:
- Container runs successfully on Jetson Orin
- All imports working correctly
- GPU detection and acceleration verified
- CUDA and cuDNN available to PyTorch
Benchmark Testing:
- RF-DETR Base: 27.2 FPS @ 36.8ms latency
- 1000 successful inferences with 0% error rate
- TensorRT acceleration confirmed working

Any specific deployment considerations

First run will take 15+ minutes for TensorRT to compile and optimize models
Subsequent runs will be fast due to engine caching in /tmp/ort_cache
Recommend using --volume ~/.inference/cache:/tmp:rw to persist cache
MAXN mode recommended for fastest builds and inference

Docs

N/A

Nov 14 '25 14:11 alexnorell