inference icon indicating copy to clipboard operation
inference copied to clipboard

Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction)

Open alexnorell opened this issue 1 month ago • 0 comments

Description

This PR optimizes the Jetson 6.2.0 Docker image by replacing the full l4t-jetpack base image with a lighter l4t-cuda:12.6.11-runtime base. This results in a 41.7% size reduction (14.2 GB → 8.28 GB) while maintaining full functionality and providing a newer CUDA version.

Key Improvements

Image Optimization

  • Size reduction: 14.2 GB → 8.28 GB (5.92 GB savings, 41.7% reduction)
  • Base image: l4t-cuda:12.6.11-runtime instead of l4t-jetpack:r36.4.0
  • CUDA version: Upgraded from 12.2 to 12.6.11
  • Build architecture: 2-stage multi-stage build (JetPack builder + minimal CUDA runtime)

Software Stack

  • onnxruntime-gpu: Compiled from source with CUDA 12.6 and TensorRT support
  • GDAL: 3.11.5 compiled from source using Ninja build system
  • PyTorch: 2.8.0 with CUDA 12.6 support from jetson-ai-lab.io
  • cuDNN: 9.3 extracted from JetPack for PyTorch compatibility
  • TensorRT: FP16 acceleration enabled by default

Performance Features

  • TensorRT execution provider enabled by default
  • FP16 precision support for faster inference
  • Engine caching to avoid recompilation on subsequent runs
  • Python symlink for inference CLI compatibility

Benchmark Results

RF-DETR Base model benchmarked on NVIDIA Jetson Orin:

Command used:

ssh roboflow@ubuntu 'sudo docker exec test-fresh inference benchmark python-package-speed -m rfdetr-base -d coco -bi 1000 -o /tmp/rfdetr_trt_benchmark.json'

Results:

  • Throughput: 27.2 FPS
  • Average Latency: 36.8 ms
  • Latency Std Dev: ±1.1 ms (very consistent)
  • Error Rate: 0.0% (1000/1000 successful inferences)
  • Percentiles:
    • P50: 37.0 ms
    • P75: 37.4 ms
    • P90: 38.6 ms
    • P95: 38.7 ms
    • P99: 38.9 ms

Test Configuration:

  • Model: rfdetr-base (29M parameters)
  • Dataset: COCO (8 validation images)
  • Batch size: 1
  • Input resolution: 560x560
  • Total inferences: 1,000
  • Warm-up: 10 inferences
  • Execution providers: TensorRT, CUDA, CPU

Technical Details

Multi-stage Build Architecture

  1. Builder Stage (l4t-jetpack:r36.4.0):

    • Compiles GDAL 3.11.5 from source
    • Builds onnxruntime-gpu with CUDA and TensorRT support
    • Installs all Python dependencies with uv
    • Builds inference packages (core, gpu, cli, sdk)
  2. Runtime Stage (l4t-cuda:12.6.11-runtime):

    • Minimal CUDA runtime with only necessary libraries
    • Copies compiled GDAL binaries
    • Copies cuDNN and TensorRT libs from builder
    • Copies Python packages and CLI tools
    • No development packages or build tools

Environment Variables

  • ONNXRUNTIME_EXECUTION_PROVIDERS=TensorrtExecutionProvider
  • ORT_TENSORRT_FP16_ENABLE=1
  • ORT_TENSORRT_ENGINE_CACHE_ENABLE=1
  • ORT_TENSORRT_ENGINE_CACHE_PATH=/tmp/ort_cache
  • REQUIRED_ONNX_PROVIDERS=TensorrtExecutionProvider

Type of change

  • [x] New feature (non-breaking change which adds functionality)
  • [x] Performance improvement (reduces image size, maintains functionality)

How has this change been tested?

  1. Build Testing:

    • Successfully built on Jetson Orin in MAXN mode
    • Build time: ~10 minutes with warm cache
    • All dependencies installed correctly
  2. Runtime Testing:

    • Container runs successfully on Jetson Orin
    • All imports working correctly
    • GPU detection and acceleration verified
    • CUDA and cuDNN available to PyTorch
  3. Benchmark Testing:

    • RF-DETR Base: 27.2 FPS @ 36.8ms latency
    • 1000 successful inferences with 0% error rate
    • TensorRT acceleration confirmed working

Any specific deployment considerations

  • First run will take 15+ minutes for TensorRT to compile and optimize models
  • Subsequent runs will be fast due to engine caching in /tmp/ort_cache
  • Recommend using --volume ~/.inference/cache:/tmp:rw to persist cache
  • MAXN mode recommended for fastest builds and inference

Docs

N/A

alexnorell avatar Nov 14 '25 14:11 alexnorell