TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

🐛 [Bug] RuntimeError when saving a compiled nn.Module

Open LucaBonfiglioli opened this issue 3 years ago • 6 comments

Bug Description

I get RuntimeError when I try to save a compiled Tensor RT module. The error message says to report a bug, and I found no other issues of this kind.

To Reproduce

Unfortunately I am not allowed to fully publish the code that raises this error. But I can generally describe what I am doing to give you some hints.

I have a very complex nn.Module with a mixture of neural and algorithmic parts. This module can be exported only via torch.jit.script, because many of the submodules have control statements that depend on the input data (i.e. not fully traceable). So what I am doing is selecting the traceable submodules and compiling them individually with Tensor RT, then I replace the original submodules with the compiled ones and pass the whole thing to torch.jit.script. This way I have a single, compiled mixture of TorchScript and TensorRT.

This is what I do to compile the models with TRT:

    def _trt_compile_core(self, core: CoreModel, dummy_input: Tensor) -> CoreModel:
        trace_core = torch.jit.trace(core, dummy_input, strict=False).eval()  # type: ignore
        trt_core = torch_tensorrt.ts.compile(
            trace_core,
            inputs=[torch_tensorrt.Input(dummy_input.shape, dtype=torch.float32)],
            device={
                "device_type": torch_tensorrt.DeviceType.GPU,
                "gpu_id": 0,
                "dla_core": 0,
                "allow_gpu_fallback": True,
            },
            truncate_long_and_double=True,
        )

        return trt_core

The resulting module is tested before saving, and it works perfectly (1e-5 relative error) when compared to the original pytorch module (also with a massive speedup!), so we can exclude weird compilation errors. I also check that the compiled module retains the control logic of the original module, and yes, everything works fine.

The problem arises whenever I try to save it, by calling module.save(), when I get this error:

Traceback (most recent call last):
  File "tensorrt/tensorrt_export.py", line 104, in <module>
    inference_trt.save("/tmp/trt.jit")
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 714, in save
    return self._c.save(str(f), **kwargs)
RuntimeError: method.qualname() == QualifiedName(selfClass->name()->qualifiedName(), methodName) INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/torch/csrc/jit/serialization/python_print.cpp":1137, please report a bug to PyTorch.

Expected behavior

Saving the model without rasing RuntimeError

Environment

I am using the nvcr.io/nvidia/pytorch:22.06-py3 docker image to run these tests. I applied no changes to the docker container, beside installing my own python package.

  • Torch-TensorRT Version (e.g. 1.0.0): 8.2.5.1
  • PyTorch Version (e.g. 1.0): 1.13.0a0+340c412
  • CPU Architecture: amd64
  • OS (e.g., Linux): Ubuntu 22.04
  • How you installed PyTorch (conda, pip, libtorch, source): pre-installed with the docker container.
  • Python version: 3.8.13
  • CUDA version: 11.4 - pre-installed with the docker container.
  • GPU models and configuration: Nvidia RTX 2080Ti
  • Any other relevant information: I had to manually uninstall and reinstall opencv-python-headless package because it came with a circular import error on module init 💩

LucaBonfiglioli avatar Aug 01 '22 15:08 LucaBonfiglioli

Is there anyway to provide a minimal repro for this? This should be a supported usecase. Potentially the problem might not be with the TorchTRT serialization but trying to use torch.jit.save on an submodule that is nn.Module but that is just a guess.

narendasan avatar Aug 01 '22 16:08 narendasan

Why would there be a nn.Module among the submodules after calling torch.jit.script?

Anyway, I am currently trying to simplify the code while still reproducing the bug to provide you with a minimal example.

LucaBonfiglioli avatar Aug 02 '22 07:08 LucaBonfiglioli

Correct me if I am not understanding your workflow properly but it seems like you have a nn.Module with some submodules, then you pick some submodules and trace them then compile them with Torch-TRT, then you script the whole thing (nn.Modules + Torch-TRT torchscript modules), then you try to save the resultant torchscript? Is there a reason scripting the full model and then passing that full model to Torch-TRT doesn't work?

narendasan avatar Aug 02 '22 16:08 narendasan

@narendasan Yes I think you understood correctly.

The main reason why I don't script the whole thing before compiling with TRT is that to make it compile with TRT it would require very invasive changes to all the codebase, which I am not going to do unless truly necessary, and not just for a performance test. Example: I need to work with truncated longs and doubles, so I cannot use torch.gather since it requires int64 tensors, as a result, I would have to re-implement a torch.gather that works with int32. But there are other cases in which I cannot compile with TRT, but just with torch.jit.script

In its current state, my code can be exported with torch.jit.script without any issue, also torch.jit.trace also works, but that renders it useless, since the control statements become constants.

Also, combining jit.script and TRT works just fine, the model is behaving correctly, it just raises an error only when saving the model, which is very strange. Also, why would an attempt to save a perfectly working model trigger an internal assert?

LucaBonfiglioli avatar Aug 03 '22 09:08 LucaBonfiglioli

I'm getting the same error, unfortunately my code is private too. Another big problem is that error doesn't give any hint on what's the cause of the problem, so I don't even know where to search among the thousands of lines of my code.

domef avatar Aug 03 '22 16:08 domef

Well it at least seems like the issue is coming from how TorchScript manages classes and associated methods, so in normal PyTorch this would be like the relationship between MyModule and forward in a module like this:

class MyModule(nn.Module):
    def __init__(self):
       ...
    def forward(self, ...):
       ...

It might be that we are not satisfying this relationship properly, but not sure why this only appears when you look to script and serialize a hybrid nn.Module/Torch-TRT module.

We are wrapping up a release right now and then I can take a deeper look, what would be helpful is if you can put together a minimal script which exhibits this behavior.

Also are you not unable to use torch_executed_modules to help separate out TorchScript and TRT submodules or is it too much work?

narendasan avatar Aug 12 '22 01:08 narendasan

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] avatar Dec 02 '22 00:12 github-actions[bot]