TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

🐛 [Bug] Advanced Indexing/GatherND compilation causes error: `isObject() INTERNAL ASSERT FAILED at "libtorch/_virtual_includes/ATen/ATen/core/ivalue_inl.h":123, please report a bug to PyTorch. Expected Object but got None`

Open chaoz-dev opened this issue 3 years ago • 4 comments

Bug Description

PyTorch uses advanced indexing to implement GatherND. For example:

torch[[0,1], :, None, torch.tensor((0,1))]

is considered a valid indexing operation. However, when applying TensorRT compilation, we see the following error:

isObject() INTERNAL ASSERT FAILED at "bazel-out/k8-opt/bin/external/libtorch/_virtual_includes/ATen/ATen/core/ivalue_inl.h":123, please report a bug to PyTorch. Expected Object but got None

To Reproduce

Run with nvcr.io/nvidia/pytorch:22.07-py3 container (gathernd.py):

  import tensorrt
  import torch
  import torch_tensorrt

  class GatherNDListModel(torch.nn.Module):
      def __init__(self):
          super().__init__()

      def forward(self, x):
          return x[[1,0], :, [1,0]]


  class GatherNDTensorModel(torch.nn.Module):
      def __init__(self):
          super().__init__()
          self.index = torch.tensor((1,0))

      def forward(self, x):
          return x[self.index, :, self.index]

  if __name__ == '__main__':
      torch.manual_seed(0)
      t = torch.randn(3, 4, 5).cuda()

      try:
          m = GatherNDListModel()
          o = m(t)
          print(f"PyTorch: {o.shape}")

          mt = torch_tensorrt.compile(m, inputs=[t], truncate_long_and_double=True)
          o = mt(t)
          print(f"TRT: {o.shape}")
      except Exception as err:
          print(f"GatherND list indexing failed: {err}")

      try:
          m = GatherNDTensorModel()
          o = m(t)
          print(f"PyTorch: {o.shape}")

          mt = torch_tensorrt.compile(m, inputs=[t], truncate_long_and_double=True)
          o = mt(t)
          print(f"TRT: {o.shape}")
      except Exception as err:
          print(f"GatherND tensor indexing failed: {err}")

Outputs the following:

PyTorch: torch.Size([2, 4])
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
GatherND list indexing failed: isObject() INTERNAL ASSERT FAILED at "bazel-out/k8-opt/bin/external/libtorch/_virtual_includes/ATen/ATen/core/ivalue_inl.h":123, please report a bug to PyTorch. Expected Object but got None
PyTorch: torch.Size([2, 4])
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
GatherND tensor indexing failed: isObject() INTERNAL ASSERT FAILED at "bazel-out/k8-opt/bin/external/libtorch/_virtual_includes/ATen/ATen/core/ivalue_inl.h":123, please report a bug to PyTorch. Expected Object but got None

Note that the pytorch version (ie. no-TRT) model outputs the expected shapes, but the TRT conversion fails.

Expected behavior

The TRT compilation should succeed and output the same output as the non-TRT version.

Environment

Build information about Torch-TensorRT can be found by turning on debug messages nvcr.io/nvidia/pytorch:22.07-py3

Additional context

https://partners.nvidia.com/Bug/ViewBug/3735309

chaoz-dev avatar Aug 16 '22 18:08 chaoz-dev

I have a fix for this error as well as the general None case but we hit the current limitation where the converter does not support more than one tensor used to index. https://github.com/pytorch/TensorRT/blob/c63a5a5717e28d1a195740356fffbb9afcff36a8/core/conversion/converters/impl/select.cpp#L287

@ruoqianguo Do you have more information on this limitation? Is this something we can address but have not had time or is it a fundamental tensorrt limitation?

narendasan avatar Aug 17 '22 05:08 narendasan

@narendasan The complete function looks like that https://github.com/pytorch/TensorRT/pull/921#discussion_r823279468. I think we can address this problem and i will try to support several indices input.

ruoqianguo avatar Aug 17 '22 09:08 ruoqianguo

Just tested this on master d1768aa3d2c7d7d91d9f061e3e5dc5f976124dfe built in NGC pytorch:22.08-py3, but I'm still seeing the same errors running the above script:

root@2e1a9b9e2880:/opt/TensorRT# python /scripts/gathernd.py
PyTorch: torch.Size([2, 4])
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
GatherND list indexing failed: isObject() INTERNAL ASSERT FAILED at "bazel-out/k8-opt/bin/external/libtorch/_virtual_includes/ATen/ATen/core/ivalue_inl.h":123, please report a bug to PyTorch. Expected Obj
ect but got None
PyTorch: torch.Size([2, 4])
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
GatherND tensor indexing failed: isObject() INTERNAL ASSERT FAILED at "bazel-out/k8-opt/bin/external/libtorch/_virtual_includes/ATen/ATen/core/ivalue_inl.h":123, please report a bug to PyTorch. Expected O
bject but got None

chaoz-dev avatar Sep 06 '22 19:09 chaoz-dev

In container pytorch/pytorch:1.12.0-cuda11.3-cudnn8-devel, I just tested this script on master d1768aa3d2c7d7d91d9f061e3e5dc5f976124dfe. And the results look like correct. Whether this results are related to container?

PyTorch: torch.Size([2, 4])
WARNING: [Torch-TensorRT] - For input x.1, found user specified input dtype as Float32. The compiler is going to use the user setting Float32
WARNING: [Torch-TensorRT] - If indices include negative values, the exported graph will produce incorrect results.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.3.2
WARNING: [Torch-TensorRT TorchScript Conversion Context] - The getMaxBatchSize() function should not be used with an engine built from a network created with
NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - The getMaxBatchSize() function should not be used with an engine built from a network created with
NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
TRT: torch.Size([2, 4])
PyTorch: torch.Size([2, 4])
WARNING: [Torch-TensorRT] - For input x.1, found user specified input dtype as Float32. The compiler is going to use the user setting Float32
WARNING: [Torch-TensorRT] - If indices include negative values, the exported graph will produce incorrect results.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.3.2
WARNING: [Torch-TensorRT TorchScript Conversion Context] - The getMaxBatchSize() function should not be used with an engine built from a network created with
NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - The getMaxBatchSize() function should not be used with an engine built from a network created with
NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
TRT: torch.Size([2, 4])

ruoqianguo avatar Sep 07 '22 01:09 ruoqianguo

Verified on v1.2 again. Closing. @chaoz-dev please comment if you're still facing this issue.

ncomly-nvidia avatar Sep 28 '22 18:09 ncomly-nvidia

Keeping this open since we are seeing some intermittent issues.

narendasan avatar Oct 12 '22 20:10 narendasan

Tested the latest release 1.2.0 using NGC nvcr.io/nvidia/pytorch:22.09-py3. Running the above script, I'm still seeing the same errors:

PyTorch: torch.Size([2, 4])
WARNING: [Torch-TensorRT] - For input x.1, found user specified input dtype as Float32. The compiler is going to use the user setting Float32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
**GatherND list indexing failed: isObject() INTERNAL ASSERT FAILED at "bazel-out/k8-opt/bin/external/libtorch/_virtual_includes/ATen/ATen/core/ivalue_inl.h":123, please report a bug to PyTorch. Expected Object but got None**
PyTorch: torch.Size([2, 4])
WARNING: [Torch-TensorRT] - For input x.1, found user specified input dtype as Float32. The compiler is going to use the user setting Float32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
**GatherND tensor indexing failed: isObject() INTERNAL ASSERT FAILED at "bazel-out/k8-opt/bin/external/libtorch/_virtual_includes/ATen/ATen/core/ivalue_inl.h":123, please report a bug to PyTorch. Expected Object but got None**

@ruoqianguo for viz. Your output above looks correct though, so it looks like this should work... checking the 1.2 release, looks like it should contain d1768aa3d2c7d7d91d9f061e3e5dc5f976124dfe, so we should be ahead of the working commit here...

chaoz-dev avatar Oct 13 '22 02:10 chaoz-dev

@chaoz-dev is this ok to close?

narendasan avatar Dec 15 '22 17:12 narendasan

This is good to close on my side

chaoz-dev avatar Dec 16 '22 01:12 chaoz-dev