TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

❓ [Question] Error when exporting TorchScript model with dynamic input using Torch-TensorRT (aten::clamp issue)

Open Mmmyyym opened this issue 4 months ago • 13 comments

Environment

Libtorch 2.5.0.dev (latest nightly) (built with CUDA 12.4) CUDA 12.4 TensorRT 10.1.0.27 PyTorch 2.4.0+cu124 Torch-TensorRT 2.4.0 Python 3.12.8 Windows 10

Code

import torch
import torch_tensorrt

model = DeepLabv3+(num_classes=8,  backbone='mobilenet', output_stride=16)
model.eval().cuda()
dynamic_input = torch_tensorrt.Input(
        min_shape=[1, 3, 20, 20],
        opt_shape=[1, 3, 512, 512],
        max_shape=[1, 3, 2448, 2048],
        dtype=torch.float32 )
static_input = [torch.randn((1, 3, 512, 512)).cuda()]
trt_gm = torch_tensorrt.compile(model, ir="dynamo", inputs = dynamic_input)
torch_tensorrt.save(trt_gm, "trt_dynamic.ts", output_format="torchscript",inputs=static_input)

ERROR

torch_tensorrt.compile succeeds, but torch_tensorrt.save fails.

torch_tensorrt.save(trt_gm, "trt_dynamic.ts", output_format="torchscript",inputs=static_input)

    def __call__(self_, *args, **kwargs):  # noqa: B902
        # use `self_` to avoid naming collide with aten ops arguments that
        # are named "self". This way, all the aten ops can be called by kwargs.
        return self_._op(*args, **kwargs)
Image
RuntimeError: aten::clamp() Expected a value of type 'Optional[number]' for argument 'max' but instead found type 'Tensor'.
Position: 2
Value: tensor([31], device='cuda:0')
Declaration: aten::clamp(Tensor self, Scalar? min=None, Scalar? max=None) -> Tensor
Cast error details: Cannot cast tensor([31], device='cuda:0') to number

Mmmyyym avatar Aug 19 '25 02:08 Mmmyyym

@apbose Please take a look at this, when I run with this implementation of DeepLabV3+ (https://github.com/VainF/DeepLabV3Plus-Pytorch) with dynamic shape I get the following converter error

  File "/home/naren/pytorch_org/tensorrt/.venv/lib/python3.13/site-packages/torch_tensorrt/dynamo/conversion/impl/upsample.py", line 38, in upsample
    layer.shape = shape
    ^^^^^^^^^^^
TypeError: (): incompatible function arguments. The following argument types are supported:
    1. (arg0: tensorrt_bindings.tensorrt.IResizeLayer, arg1: tensorrt_bindings.tensorrt.Dims) -> None

Invoked with: <tensorrt_bindings.tensorrt.IResizeLayer object at 0x7295eab9f130>, [1, 256, <tensorrt_bindings.tensorrt.ITensor object at 0x7295eab95f70>, <tensorrt_bindings.tensorrt.ITensor object at 0x7295eab8bcf0>]

Detection of dynamic shape here: if has_dynamic_shape(shape): does not seem to handle this case

narendasan avatar Aug 19 '25 20:08 narendasan

    def __call__(self_, *args, **kwargs):
        try:
            return self_._op(*args, **kwargs)
        except RuntimeError as e:
            if "aten::clamp" in str(e):
                new_args = list(args)
                if len(new_args) > 2:
                    max_arg = new_args[2]
                    if (isinstance(max_arg, torch.Tensor) and max_arg.numel() == 1 and float(max_arg.item()) == 31.0):
                        new_args[2] = 31
                    if (isinstance(max_arg, torch.Tensor) and max_arg.numel() == 1 and float(max_arg.item()) == 127.0):
                        new_args[2] = 127
                return self_._op(*tuple(new_args), **kwargs)
            else:
                raise

Forcing the problematic input values to int allows successful .ts export. However, in C++ inference, if the input shape isn’t [1,3,512,512] (e.g., [1,3,1948,1083]), the output tensors can differ by up to 700 compared to Python, causing inconsistent results. With [1,3,512,512], the difference is only around 0.2.

Mmmyyym avatar Aug 20 '25 06:08 Mmmyyym

@narendasan @apbose Hi, any updates on this? Thanks!

Mmmyyym avatar Aug 25 '25 09:08 Mmmyyym

Thanks for raising the issue! I shall be able to take a look at this sometime this week.

apbose avatar Aug 25 '25 16:08 apbose

Hi @apbose , just wondering if there’s any solution or workaround for this. Really appreciate it!

Mmmyyym avatar Sep 05 '25 08:09 Mmmyyym

Apologies could not take a look last week. I am currently on leave, but I will take a look the week after when I am back.

apbose avatar Sep 05 '25 10:09 apbose

started on the repro

apbose avatar Sep 17 '25 00:09 apbose

Update: seeing the same error as

(arg0: tensorrt_bindings.tensorrt.IResizeLayer, arg1: tensorrt_bindings.tensorrt.Dims) -> None

Invoked with: <tensorrt_bindings.tensorrt.IResizeLayer object at 0x732c47bf62f0>, [1, 256, <tensorrt_bindings.tensorrt.ITensor object at 0x732c47b3d030>, <tensorrt_bindings.tensorrt.ITensor object at 0x732c47cdf230>]

While executing %upsample_bilinear2d : [num_users=1] = call_function[target=torch.ops.aten.upsample_bilinear2d.vec](args = (%relu_5, [%add_1151, %add_1153], False, None), kwargs = {})

Looking into it

apbose avatar Sep 18 '25 23:09 apbose

Thank you for confirming this issue. I’ll keep track of the progress here and test again once a fix or workaround is available.

Mmmyyym avatar Sep 19 '25 01:09 Mmmyyym

Currently we have code to detect dynamic shape on the basis of -1 in the shape but in this case the shape is The shape is ==== [1, 256, <tensorrt_bindings.tensorrt.ITensor object at 0x74a251c54c70>, <tensorrt_bindings.tensorrt.ITensor object at 0x74a251c88db0>]. Need to support these cases

apbose avatar Sep 23 '25 07:09 apbose

Thanks for following up on this. Looks like there’s no workaround for this right now. Looking forward to support for ITensor dynamic shape in a future release.

Mmmyyym avatar Sep 23 '25 09:09 Mmmyyym

Code updates work correctly.
Only this part in upsample.py needs adjustment:

layer.shape = to_trt_shape_tensor(ctx, target, name, shape)
layer.set_input(1, layer.shape)

should be changed to:

trt_shape = to_trt_shape_tensor(ctx, target, name, shape)
if isinstance(trt_shape, list):
    layer.shape = trt_shape
else:
    layer.set_input(1, trt_shape)

After this fix, the model can be exported and runs correctly. The results are consistent overall, though some tensor values still show minor differences, which do not affect the final output. Further precision alignment tests have not yet been performed.

Mmmyyym avatar Oct 09 '25 06:10 Mmmyyym

thanks for the input! I will make this change, clean up the code and add a test case

apbose avatar Oct 14 '25 21:10 apbose

The above is completed. Closing! Please reopen incase you find any issue.

apbose avatar Dec 16 '25 17:12 apbose