TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

🐛 [Bug] Encountered bug when using Torch-TensorRT (We don't have an op for aten::floor_divide but it isn't a special case)

Open ghazalehtrb opened this issue 3 years ago • 7 comments

Bug Description

Hi, I'm trying to compile a model with torch_tensorrt. I was able to successfully create the scripted model but when compiling it I'm getting the following error:

INFO: [Torch-TensorRT] - ir was set to default, using TorchScript as ir
DEBUG: [Torch-TensorRT] - Settings requested for Lowering:
    torch_executed_modules: [
    ]
Traceback (most recent call last):
  File "test.py", line 103, in <module>
    trt_model = torch_tensorrt.compile(scripted_model,
  File "/media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/test_env/lib/python3.8/site-packages/torch_tensorrt/_compile.py", line 115, in compile
    return torch_tensorrt.ts.compile(ts_mod, inputs=inputs, enabled_precisions=enabled_precisions, **kwargs)
  File "/media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/test_env/lib/python3.8/site-packages/torch_tensorrt/ts/_compiler.py", line 113, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: 0INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/alias_analysis.cpp":607, please report a bug to PyTorch. We don't have an op for aten::floor_divide but it isn't a special case.  Argument types: int, int, 

Candidates:
	aten::floor_divide(Tensor self, Tensor other) -> (Tensor)
	aten::floor_divide.Scalar(Tensor self, Scalar other) -> (Tensor)
	aten::floor_divide.out(Tensor self, Tensor other, *, Tensor(a!) out) -> (Tensor(a!))

I don't know which part of the model is exactly causing this error yet, I'll post a simple version once I figure it out but I think torch_tensortrt 1.1.0 release is supposed to support floor_divide.

This is what I'm doing to compile the model:

model.eval().cuda()
scripted_model = torch.jit.script(model)

with torch_tensorrt.logging.debug():
    trt_model = torch_tensorrt.compile(scripted_model,
                    inputs = [torch_tensorrt.Input((1, 3, 16, 344, 344))],
                    enabled_precisions= {torch.half},
                    workspace_size= 1 << 20,
                    truncate_long_and_double=True,
                    require_full_compilation=False, #True
                )

Expected behavior

I was expecting floor_divide to be supported in the 1.1.0 release based on the information given here: https://github.com/pytorch/TensorRT/releases

Environment

  • Torch-TensorRT Version: 1.1.0
  • PyTorch Version: 1.11.0+cu113
  • CPU Architecture: x86_64
  • OS: Ubuntu 20.04
  • How you installed PyTorch: pip
  • Python version: 3.8
  • CUDA version: 11.3
  • GPU models and configuration: NVIDIA GeForce RTX 3070

ghazalehtrb avatar Aug 24 '22 13:08 ghazalehtrb

From what I can tell the issue is that there is some operation in your model of the form:

aten::floor_divide(int self, int other) -> ...

Which does not seem to be a valid TorchScript operator which why PyTorch not necessarily us is reporting this issue.

I took a look at PyTorch and saw there is a aten::floordiv.int operator which would make sense for the input types you have. The question is why this model has floor_divide and not floordiv if that is indeed what the operation is supposed to be? Perhaps we are inserting this erroneously in some lowering pass (This is just a theory based on limited info).

Some quick debugging steps you can take would be to grep for instances of aten::floor_divide in the debug logs, specifically the lowered graph. TorchScript includes source locations so that may help you narrow down where in your code is emitting aten::floor_divide

narendasan avatar Aug 24 '22 21:08 narendasan

This is the only lowering pass which uses floor_divide but seems to use the tensor variant so that probably is not the root cause: https://cs.github.com/pytorch/TensorRT/blob/679ea2179aaaf28fd16203d610315ddf9ea8dfe8/core/lowering/passes/reduce_remainder.cpp?q=repo%3Apytorch%2Ftensorrt+aten%3A%3Afloor_divide+language%3AC%2B%2B

narendasan avatar Aug 24 '22 21:08 narendasan

@narendasan Thank you for your response. So far it seems like what causing the issue is the Modulus operator (%).

The following network gives the same error:

class SomeNet(nn.Module):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, x: Tensor) -> Tensor:
        input_shape = x.shape
        if input_shape[2] % 2 == 0:
            return x
        else:
           return torch.tensor(0)

ghazalehtrb avatar Aug 25 '22 00:08 ghazalehtrb

This is the torchscript graph I am getting back (PyTorch 1.12.1)

torchtrt39 ❯ python /Users/naren/Developer/py/pytorch_org/tensorrt/experiments/1305.py
graph(%self : __torch__.SomeNet,
      %x.1 : Tensor):
  %16 : bool = prim::Constant[value=0]()
  %14 : NoneType = prim::Constant()
  %5 : int = prim::Constant[value=2]() # /Users/naren/Developer/py/pytorch_org/tensorrt/experiments/1305.py:12:23
  %10 : int = prim::Constant[value=0]() # /Users/naren/Developer/py/pytorch_org/tensorrt/experiments/1305.py:12:33
  %input_shape.1 : int[] = aten::size(%x.1) # <string>:13:9
  %8 : int = aten::__getitem__(%input_shape.1, %5) # /Users/naren/Developer/py/pytorch_org/tensorrt/experiments/1305.py:12:11
  %9 : int = aten::remainder(%8, %5) # /Users/naren/Developer/py/pytorch_org/tensorrt/experiments/1305.py:12:11
  %11 : bool = aten::eq(%9, %10) # /Users/naren/Developer/py/pytorch_org/tensorrt/experiments/1305.py:12:11
  %26 : Tensor = prim::If(%11) # /Users/naren/Developer/py/pytorch_org/tensorrt/experiments/1305.py:12:8
    block0():
      -> (%x.1)
    block1():
      %17 : Tensor = aten::tensor(%10, %14, %14, %16) # /Users/naren/Developer/py/pytorch_org/tensorrt/experiments/1305.py:15:18
      -> (%17)
  return (%26)

Don't seem to see the floor_divide. Maybe this changed in 1.12?

narendasan avatar Aug 25 '22 19:08 narendasan

Seems like the same graph from PyTorch 1.11

narendasan avatar Aug 25 '22 20:08 narendasan

@narendasan Sorry for the late reply!

Yes, I'm getting the same graph but tensorrt still gives the same error.

graph(%self : __torch__.MoViNet_pytorch.movinets.models.SomeNet,
      %x.1 : Tensor):
  %16 : bool = prim::Constant[value=0]()
  %14 : NoneType = prim::Constant()
  %5 : int = prim::Constant[value=2]() # /media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/MoViNet_pytorch/movinets/models.py:11:23
  %10 : int = prim::Constant[value=0]() # /media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/MoViNet_pytorch/movinets/models.py:11:33
  %input_shape.1 : int[] = aten::size(%x.1) # <string>:13:9
  %8 : int = aten::__getitem__(%input_shape.1, %5) # /media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/MoViNet_pytorch/movinets/models.py:11:11
  %9 : int = aten::remainder(%8, %5) # /media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/MoViNet_pytorch/movinets/models.py:11:11
  %11 : bool = aten::eq(%9, %10) # /media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/MoViNet_pytorch/movinets/models.py:11:11
  %26 : Tensor = prim::If(%11) # /media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/MoViNet_pytorch/movinets/models.py:11:8
    block0():
      -> (%x.1)
    block1():
      %17 : Tensor = aten::tensor(%10, %14, %14, %16) # /media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/MoViNet_pytorch/movinets/models.py:14:18
      -> (%17)
  return (%26)

Traceback (most recent call last):
  File "test.py", line 112, in <module>
    trt_model = torch_tensorrt.compile(model,
  File "/media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/test_env/lib/python3.8/site-packages/torch_tensorrt/_compile.py", line 115, in compile
    return torch_tensorrt.ts.compile(ts_mod, inputs=inputs, enabled_precisions=enabled_precisions, **kwargs)
  File "/media/andrea/Disk_21/Desktop/ARES/leav-action-recognition-pipeline/test_env/lib/python3.8/site-packages/torch_tensorrt/ts/_compiler.py", line 113, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: 0INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/alias_analysis.cpp":607, please report a bug to PyTorch. We don't have an op for aten::floor_divide but it isn't a special case.  Argument types: int, int, 

Candidates:
	aten::floor_divide(Tensor self, Tensor other) -> (Tensor)
	aten::floor_divide.Scalar(Tensor self, Scalar other) -> (Tensor)
	aten::floor_divide.out(Tensor self, Tensor other, *, Tensor(a!) out) -> (Tensor(a!))

I replaced % with the following function and I can run the code without error now!

def modulo(a: int, b: int) -> int:
    return int(a - b * torch.floor(torch.div(a, b)))

ghazalehtrb avatar Aug 29 '22 14:08 ghazalehtrb

@narendasan seems a lowering pass could be a good WAR here.

ncomly-nvidia avatar Aug 29 '22 16:08 ncomly-nvidia

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] avatar Dec 02 '22 00:12 github-actions[bot]