TensorRT
TensorRT copied to clipboard
🐛 [Bug] torch.ops.aten.remainder.Scalar seems not working with big int
Bug Description
torch.ops.aten.remainder.Scalar seems to return fmod result when input number is big
To Reproduce
save it and run the script below
import torch
import torch.nn as nn
a = torch.tensor([[5950571286963681280]]).cuda()
example_args = (a,)
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
def forward(self, x):
return torch.remainder(x, 196613)
model = ToyModel().eval().cuda()
with torch.no_grad():
ep = torch.export.export(model, args=example_args)
from torch_tensorrt.dynamo._compiler import compile as dynamo_compile
from torch_tensorrt import logging as ts_logging
with ts_logging.debug():
compiled = dynamo_compile(
exported_program=ep,
disable_tf32=True,
inputs=example_args,
min_block_size=1,
debug=True,
)
with torch.no_grad():
print(compiled(*example_args))
Expected behavior
expected to return result like
tensor([[75722]], device='cuda:0')
however, the printed result is
tensor([[-120891]], device='cuda:0')
my full execution log is remainder_error.log
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
- Torch-TensorRT Version (e.g. 1.0.0): 10.1.0
- PyTorch Version (e.g. 1.0): 2.4.1+cu124
- CPU Architecture: x86_64
- OS (e.g., Linux): linux
- How you installed PyTorch (
conda,pip,libtorch, source): pip - Build command you used (if compiling from source):
- Are you using local sources or building from archives:
- Python version: 3.11.9
- CUDA version: 12.6
- GPU models and configuration: nvidia L4
- Any other relevant information:
Additional context
BTW,
- the converted version of
torch.ops.aten.remainder.Scalarseems not even as fast as original ops. - it seems
torch.ops.aten.remainder.Scalarworks with int that is not that big. Not sure if this is caused by int64
Thanks for pointing this out.
I looked into this a bit.
TRT does not support fmod operation directly. So in torchTRT we implement it as
fmod(fmod(dividend, divisor) + divisor)
and fmod in turn is sub(dividend, prod(trunc_div(dividend, divisor), divisor))
Generally dividend > prod(trunc_div(dividend, divisor), divisor)
But in large integers trunc_div(dividend, divisor) in this case results in 30265401409536 (should be 30,265,401,000,766) which results in prod(trunc_div(dividend, divisor), divisor) > dividend and results in the negative number.
As you said, 5950571286963681280 falls in the signed int64 range, so I am not sure why TRT is returning reduced precision. I can get it clarified more from TRT team. It must be loss of accuracy in computation. Please note that float32 would also lead in accuracy loss.
Thanks @apbose for your help. I have tried to export this graph to onnx and compile it with trtexec, it seems the same issue. The result I get by this way is -80369420288,
I have attached my exported onnx in this scalar.zip
What is the suggested way to deal with these big numbers, do you have any suggestions?