🐛 [Bug] Exporting engine with `hardware_compatible` does not create hardware compatible egine
Bug Description
from tensorrt import Logger, Runtime
from torch import randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torch_tensorrt import convert_method_to_trt_engine
# Create model
weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights).eval()
example_input = randn(1, 3, 224, 224)
# Create TRT engine
engine_bytes = convert_method_to_trt_engine(
model,
ir="dynamo",
inputs=[example_input],
version_compatible=True,
hardware_compatible=True,
require_full_compilation=True
)
# Check hardware compat
logger = Logger(Logger.WARNING)
runtime = Runtime(logger)
engine = runtime.deserialize_cuda_engine(engine_bytes)
print("Hardware compat level:", engine.hardware_compatibility_level)
# prints: Hardware compat level: HardwareCompatibilityLevel.NONE
I am running on an A100 (sm_80). I also see that the correct hardware_compatible flag is being passed to C++, from the torch-tensorrt logger:
CompilationSettings(
enabled_precisions={<dtype.f32: 7>},
workspace_size=1073741824,
min_block_size=5,
torch_executed_ops=set(),
pass_through_build_failures=False,
max_aux_streams=None,
version_compatible=True,
optimization_level=3,
use_python_runtime=False,
truncate_double=False,
use_fast_partitioner=True,
enable_experimental_decompositions=False,
device=Device(type=DeviceType.GPU, gpu_id=0),
equire_full_compilation=True,
disable_tf32=False,
assume_dynamic_shape_support=False,
sparse_weights=False, engine_capability=<EngineCapability.STANDARD: 1>,
num_avg_timing_iters=1, dla_sram_size=1048576,
dla_local_dram_size=1073741824,
dla_global_dram_size=536870912,
dryrun=False,
hardware_compatible=True,
timing_cache_path='/tmp/torch_tensorrt_engine_cache/timing_cache.bin',
lazy_engine_init=False,
cache_built_engines=False,
reuse_cached_engines=False,
use_explicit_typing=False,
use_fp32_acc=False,
refit_identical_engine_weights=False,
strip_engine_weights=False,
immutable_weights=True,
enable_weight_streaming=False,
enable_cross_compile_for_windows=False,
tiling_optimization_level='none',
l2_limit_for_tiling=-1,
use_distributed_mode_trace=False,
offload_module_to_cpu=False
)
Any ideas.
To Reproduce
Steps to reproduce the behavior:
- Run Python script above
Expected behavior
Engine hardware compatibility shows AMPERE_PLUS.
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
- Torch-TensorRT Version (e.g. 1.0.0): 2.9.0
- PyTorch Version (e.g. 1.0): 2.9.0
- CPU Architecture: x86_64
- OS (e.g., Linux): Linux
- How you installed PyTorch (
conda,pip,libtorch, source):pip - Build command you used (if compiling from source):
- Are you using local sources or building from archives: No
- Python version: 3.12
- CUDA version: 12.8
- GPU models and configuration: Nvidia A100
- Any other relevant information: None
I tried on machine A40 with
torch_tensorrt 2.10.0.dev0+29002ed
tensorrt 10.13.3.9
it worked. I will try once on A100 with release 2.9 and let you know.
@apbose thanks for the help.
Any updates on this?
I could repro on A100 torchTRT 2.9. Need to look into this
That's good. Since the flag worked on torch_tensorrt==2.10.0.dev, do you have any idea when 2.10 might be released? We need a fix ASAP. Thanks for your help on this.
yes this should work in the upcoming 2.10 release which will be in Jan 2026. Meanwhile you can use the nightly container ghcr.io/pytorch/tensorrt/torch_tensorrt:nightly