TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

🐛 [Bug] Cannot compile SwinIR model (shape_analysis.cpp: Expected ivalues_maps.count(input) to be true but got false)

Open arition opened this issue 2 years ago • 17 comments

Bug Description

Cannot compile the SwinIR model.

Error message:

Traceback (most recent call last):
  File "main.py", line 61, in <module>
    compile_tensorrt_model(torch.float)
  File "main.py", line 56, in compile_tensorrt_model
    compiled_model = torch_tensorrt.compile(traced_model, inputs=inputs, enabled_precisions=enabled_precisions,
  File "/usr/local/lib/python3.8/dist-packages/torch_tensorrt/_compile.py", line 125, in compile
    return torch_tensorrt.ts.compile(
  File "/usr/local/lib/python3.8/dist-packages/torch_tensorrt/ts/_compiler.py", line 136, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: [Error thrown at core/partitioning/shape_analysis.cpp:167] Expected ivalues_maps.count(input) to be true but got false
Could not find torch::jit::Value* 71852 produced from %71852 : Tensor = aten::add(%71851, %71850, %71848) in lowering graph for mini graph input.

To Reproduce

The original code is not properly typed, so I modified it a bit. Repo: https://github.com/arition/SwinIR-TensorRT

What I changed compared to original code:

  • Add proper typing
  • Disable use_checkpoint
  • Disable other variants except real-world SR
  • Replace modulo operator to custom function according to https://github.com/pytorch/TensorRT/issues/1305

To reproduce, just download pretrained weight (link in code) and run main.py.

Expected behavior

The model compiles without problems.

Environment

I use PyTorch container 23.01-py3 on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

  • CPU Architecture: x64
  • OS (e.g., Linux): Linux
  • GPU models and configuration: RTX 4090

arition avatar Feb 21 '23 04:02 arition

Any updates on this issue?

arition avatar Mar 31 '23 07:03 arition

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] avatar Jun 30 '23 00:06 github-actions[bot]

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] avatar Oct 09 '23 00:10 github-actions[bot]

Still no updates?

arition avatar Oct 10 '23 04:10 arition

Did you have any luck finding the issue @arition. I am using the pytorch implementation of the Swin Transformer and having the same problem occuring. It is not able to identify the aten::add op when calculating the shifted window attention and throws the same error. I was able to get the model to compile using only torchscript but unsuccessful when combining it with torch_tensorrt.

willianck avatar Jan 24 '24 19:01 willianck

Would appreciate if a maintainer of this repo could point me in the right direction @narendasan

willianck avatar Jan 24 '24 19:01 willianck

Hi @willianck could you please share the logs? Thanks!

bowang007 avatar Jan 24 '24 21:01 bowang007

Sure here it is.

Error message:

Traceback (most recent call last):
  File "/home/manifold12/Software/benchmark/test_inference/real_time_inference.py", line 228, in <module>
    main()
  File "/home/manifold12/Software/benchmark/test_inference/real_time_inference.py", line 220, in main
    batch_inference(args.model,
  File "/home/manifold12/Software/benchmark/test_inference/real_time_inference.py", line 152, in batch_inference
    optimized_model = trt.compile(
  File "/home/manifold12/blend/lib/python3.10/site-packages/torch_tensorrt/_compile.py", line 185, in compile
    compiled_ts_module: torch.jit.ScriptModule = torchscript_compile(
  File "/home/manifold12/blend/lib/python3.10/site-packages/torch_tensorrt/ts/_compiler.py", line 151, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: [Error thrown at core/partitioning/shape_analysis.cpp:183] Expected ivalues_maps.count(input) to be true but got false
Could not find torch::jit::Value* attn.21 produced from %attn.21 : Tensor = aten::add(%attn.9, %36217, %46) # /home/manifold12/blend/lib/python3.10/site-packages/torchvision/models/swin_transformer.py:192:11 in lowering graph for mini graph input.

willianck avatar Jan 25 '24 18:01 willianck

@willianck could you please share the full log plz? Looks like in some blocks of the IR the operations cannot capture the outer variables above. I might have a fix for that. If you could share the full log that would be more helpful. Thanks!

bowang007 avatar Jan 25 '24 22:01 bowang007

Got It, here is the full log using the debug log level from torch_tensorrt . I placed it inside a file for you to view as It is quite a lot of lines. Let me know if it okay and I can alternatively paste the logs on here. output.txt

willianck avatar Jan 25 '24 23:01 willianck

I was able to get it to work using dynamo or torch_compile instead of torchscript in the ir argument of torch_tensorrt.compile() as shown below.

import torch_tensorrt as trt
  optimized_model = trt.compile(
                                    model,
                                    ir='dynamo',
                                    inputs= inputs,
                                    enabled_precisions=enabled_precisions,
     )

With this new implementation I am now experience a separate issue. When testing at different batch sizes the memory consumption is abnormally high and leads to OOM error at batch sizes I was able to do inference on before (no torch.compile() call ). This is similar to this issue #1854 i suppose . Something else I noticed was that for certain batch sizes it would not throw an OOM error but would throw this error instead as seen in a snippet of the full log below:


INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.090343
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:01.634456
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 271888896 bytes of Memory
DEBUG: [Torch-TensorRT] - Serialized Device Info: 0%8%9%0%NVIDIA GeForce RTX 4090
DEBUG: [Torch-TensorRT] - Deserializing Device Info: 0%8%9%0%NVIDIA GeForce RTX 4090
DEBUG: [Torch-TensorRT] - Deserialized Device Info: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
DEBUG: [Torch-TensorRT] - Target Device: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
DEBUG: [Torch-TensorRT] - Setting Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU) as active device
INFO: [Torch-TensorRT] - The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
INFO: [Torch-TensorRT] - Loaded engine size: 20 MiB
DEBUG: [Torch-TensorRT] - Deserialization required 4252 microseconds.
INFO: [Torch-TensorRT] - [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +20, now: CPU 0, GPU 4910 (MiB)
DEBUG: [Torch-TensorRT] - Total per-runner device persistent memory is 0
DEBUG: [Torch-TensorRT] - Total per-runner host persistent memory is 64
DEBUG: [Torch-TensorRT] - Allocated activation device memory of size 271888896
INFO: [Torch-TensorRT] - [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +260, now: CPU 0, GPU 5170 (MiB)
DEBUG: [Torch-TensorRT] - CUDA lazy loading is enabled.
DEBUG: [Torch-TensorRT] - Input binding name: roll_13 has TensorRT binding index: 0, Torch binding index: 0
DEBUG: [Torch-TensorRT] - Input binding name: add_57 has TensorRT binding index: 1, Torch binding index: 1
DEBUG: [Torch-TensorRT] - Output binding name: output0 has TensorRT binding index: 3, Torch binding index: 2
DEBUG: [Torch-TensorRT] - Output binding name: output1 has TensorRT binding index: 2, Torch binding index: 3
DEBUG: [Torch-TensorRT] - Torch-TensorRT TensorRT Engine:
  Name: _run_on_acc_28_engine
  Inputs: [
    id: 0
      name: roll_13
      shape: [8, 35, 35, 512]
      dtype: Float
    id: 1
      name: add_57
      shape: [8, 32, 32, 512]
      dtype: Float
  ]
  Outputs: [
    id: 0
      name: output0
      shape: [8, 35, 35, 512]
      dtype: Float
    id: 1
      name: output1
      shape: [8, 32, 32, 512]
      dtype: Float
  }
  Device: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
  Hardware Compatibility: Disabled

INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.021885
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:00.661630
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 151155200 bytes of Memory
DEBUG: [Torch-TensorRT] - Serialized Device Info: 0%8%9%0%NVIDIA GeForce RTX 4090
DEBUG: [Torch-TensorRT] - Deserializing Device Info: 0%8%9%0%NVIDIA GeForce RTX 4090
DEBUG: [Torch-TensorRT] - Deserialized Device Info: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
DEBUG: [Torch-TensorRT] - Target Device: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
DEBUG: [Torch-TensorRT] - Setting Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU) as active device
INFO: [Torch-TensorRT] - The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
INFO: [Torch-TensorRT] - Loaded engine size: 4 MiB
DEBUG: [Torch-TensorRT] - Deserialization required 1199 microseconds.
INFO: [Torch-TensorRT] - [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +4, now: CPU 0, GPU 5161 (MiB)
DEBUG: [Torch-TensorRT] - Total per-runner device persistent memory is 0
DEBUG: [Torch-TensorRT] - Total per-runner host persistent memory is 32
DEBUG: [Torch-TensorRT] - Allocated activation device memory of size 151155200
INFO: [Torch-TensorRT] - [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +144, now: CPU 0, GPU 5305 (MiB)
DEBUG: [Torch-TensorRT] - CUDA lazy loading is enabled.
DEBUG: [Torch-TensorRT] - Input binding name: roll_14 has TensorRT binding index: 0, Torch binding index: 0
DEBUG: [Torch-TensorRT] - Output binding name: output0 has TensorRT binding index: 1, Torch binding index: 1
DEBUG: [Torch-TensorRT] - Torch-TensorRT TensorRT Engine:
  Name: _run_on_acc_30_engine
  Inputs: [
    id: 0
      name: roll_14
      shape: [8, 35, 35, 512]
      dtype: Float
  ]
  Outputs: [
    id: 0
      name: output0
      shape: [8, 35, 35, 512]
      dtype: Float
  }
  Device: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
  Hardware Compatibility: Disabled

INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.087953
[01/26/2024-19:39:56] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[[SLICE]-[unknown_ir_ops.slice.Tensor]-[/features/5/__11/attn/slice_301]...[ELEMENTWISE]-[aten_ops.native_layer_norm.default]-[/features/5/__12/norm1/native_layer_norm_35_add_beta]]}.
[01/26/2024-19:39:56] [TRT] [E] 10: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[[SLICE]-[unknown_ir_ops.slice.Tensor]-[/features/5/__11/attn/slice_301]...[ELEMENTWISE]-[aten_ops.native_layer_norm.default]-[/features/5/__12/norm1/native_layer_norm_35_add_beta]]}.)
Traceback (most recent call last):
  File "/root/test_inference/real_time_inference.py", line 238, in <module>
    main()
  File "/root/test_inference/real_time_inference.py", line 230, in main
    batch_inference(args.model,
  File "/root/test_inference/real_time_inference.py", line 161, in batch_inference
    optimized_model = trt.compile(
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/_compile.py", line 228, in compile
    trt_graph_module = dynamo_compile(
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/_compiler.py", line 245, in compile
    return compile_module(gm, inputs, settings)
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/_compiler.py", line 415, in compile_module
    trt_module = convert_module(
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/conversion/_conversion.py", line 75, in convert_module
    interpreter_result = interpret_module_to_result(module, inputs, settings)
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/conversion/_conversion.py", line 56, in interpret_module_to_result
    interpreter_result = interpreter.run()
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py", line 256, in run
    assert engine
AssertionError

Would also like to point out that when I tried running these tests with the same batch sizes on a gpu with higher memory RTX A6000 , the memory consumption was still very high but It did not run into any OOM error or the error shown above.

Environment

I am running all these tests in a container build on top of the pytorch Tensor RT docker image as base image. https://github.com/pytorch/TensorRT

  • CPU Architecture: x64
  • OS (e.g., Linux): Linux
  • GPU models and configuration: RTX 4090, RTX A6000
  • python version: 3.10
  • pytorch version: 2.3.0.dev
  • torch_tensorrt version: 2.3.0.dev
  • tensorrt version: 8.6.1
  • CUDA version: 12.1
  • CUDNN version: 8.9.2

willianck avatar Jan 26 '24 20:01 willianck

Looks like an operator conversion issue. @gs-olive @zewenli98 Do we support this aten_ops.native_layer_norm.default operation?

bowang007 avatar Jan 30 '24 23:01 bowang007

@bowang007 - yes, there is support for that operator, as here: https://github.com/pytorch/TensorRT/blob/cf3a6887626c648e5747fdbfa5bc62b361a82b02/py/torch_tensorrt/dynamo/conversion/aten_ops_converters.py#L123-L152

gs-olive avatar Jan 31 '24 18:01 gs-olive

any updates on this issue ? @bowang007

willianck avatar Feb 07 '24 22:02 willianck

Any updates? @bowang007

arition avatar Feb 24 '24 02:02 arition

After going through the logs:

  1. For TorchScript path, looks like there are some bugs in our partitioning workflow. The TorchScript partitioning workflow was developed several years ago and we used some naive greedy algorithm for graph segmentation at that time, we TorchScript path is being deprecated right now, we don't have a plan to support that.
  2. For dynamo path, I guess there is a bug when converting layer_norm to some TensorRT layers (some slice layer is introduced). Let me check with our dev team and run this model if possible.

Thanks!

bowang007 avatar Feb 26 '24 18:02 bowang007

@bowang007 Thanks for your analysis! Looking forward to find the actual cause and fix the bug!

arition avatar Feb 26 '24 20:02 arition