inference
inference copied to clipboard
Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction)
Description
This PR optimizes the Jetson 6.2.0 Docker image by replacing the full l4t-jetpack base image with a lighter l4t-cuda:12.6.11-runtime base. This results in a 41.7% size reduction (14.2 GB → 8.28 GB) while maintaining full functionality and providing a newer CUDA version.
Key Improvements
Image Optimization
- Size reduction: 14.2 GB → 8.28 GB (5.92 GB savings, 41.7% reduction)
- Base image:
l4t-cuda:12.6.11-runtimeinstead ofl4t-jetpack:r36.4.0 - CUDA version: Upgraded from 12.2 to 12.6.11
- Build architecture: 2-stage multi-stage build (JetPack builder + minimal CUDA runtime)
Software Stack
- onnxruntime-gpu: Compiled from source with CUDA 12.6 and TensorRT support
- GDAL: 3.11.5 compiled from source using Ninja build system
- PyTorch: 2.8.0 with CUDA 12.6 support from jetson-ai-lab.io
- cuDNN: 9.3 extracted from JetPack for PyTorch compatibility
- TensorRT: FP16 acceleration enabled by default
Performance Features
- TensorRT execution provider enabled by default
- FP16 precision support for faster inference
- Engine caching to avoid recompilation on subsequent runs
- Python symlink for inference CLI compatibility
Benchmark Results
RF-DETR Base model benchmarked on NVIDIA Jetson Orin:
Command used:
ssh roboflow@ubuntu 'sudo docker exec test-fresh inference benchmark python-package-speed -m rfdetr-base -d coco -bi 1000 -o /tmp/rfdetr_trt_benchmark.json'
Results:
- Throughput: 27.2 FPS
- Average Latency: 36.8 ms
- Latency Std Dev: ±1.1 ms (very consistent)
- Error Rate: 0.0% (1000/1000 successful inferences)
- Percentiles:
- P50: 37.0 ms
- P75: 37.4 ms
- P90: 38.6 ms
- P95: 38.7 ms
- P99: 38.9 ms
Test Configuration:
- Model: rfdetr-base (29M parameters)
- Dataset: COCO (8 validation images)
- Batch size: 1
- Input resolution: 560x560
- Total inferences: 1,000
- Warm-up: 10 inferences
- Execution providers: TensorRT, CUDA, CPU
Technical Details
Multi-stage Build Architecture
-
Builder Stage (
l4t-jetpack:r36.4.0):- Compiles GDAL 3.11.5 from source
- Builds onnxruntime-gpu with CUDA and TensorRT support
- Installs all Python dependencies with uv
- Builds inference packages (core, gpu, cli, sdk)
-
Runtime Stage (
l4t-cuda:12.6.11-runtime):- Minimal CUDA runtime with only necessary libraries
- Copies compiled GDAL binaries
- Copies cuDNN and TensorRT libs from builder
- Copies Python packages and CLI tools
- No development packages or build tools
Environment Variables
ONNXRUNTIME_EXECUTION_PROVIDERS=TensorrtExecutionProviderORT_TENSORRT_FP16_ENABLE=1ORT_TENSORRT_ENGINE_CACHE_ENABLE=1ORT_TENSORRT_ENGINE_CACHE_PATH=/tmp/ort_cacheREQUIRED_ONNX_PROVIDERS=TensorrtExecutionProvider
Type of change
- [x] New feature (non-breaking change which adds functionality)
- [x] Performance improvement (reduces image size, maintains functionality)
How has this change been tested?
-
Build Testing:
- Successfully built on Jetson Orin in MAXN mode
- Build time: ~10 minutes with warm cache
- All dependencies installed correctly
-
Runtime Testing:
- Container runs successfully on Jetson Orin
- All imports working correctly
- GPU detection and acceleration verified
- CUDA and cuDNN available to PyTorch
-
Benchmark Testing:
- RF-DETR Base: 27.2 FPS @ 36.8ms latency
- 1000 successful inferences with 0% error rate
- TensorRT acceleration confirmed working
Any specific deployment considerations
- First run will take 15+ minutes for TensorRT to compile and optimize models
- Subsequent runs will be fast due to engine caching in
/tmp/ort_cache - Recommend using
--volume ~/.inference/cache:/tmp:rwto persist cache - MAXN mode recommended for fastest builds and inference
Docs
N/A