TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

buildSerializedNetwork failure of TensorRT 10.1 on GPU A10G/3070 - `Error Code 2: Internal Error (Assertion mConfig.caskKlibMapPtr failed. )`

Open shingjan opened this issue 1 year ago • 2 comments

Description

I am using the pytorch tensorrt lib to compile a simple pytorch model to tensorrt:

def func(x):
  return torch.ops.aten.clamp(x, 0, 1)

This works before tensorrt==10.1.0 but with the latest release I am seeing [06/18/2024-12:59:43] [TRT] [E] [builder.cpp::buildSerializedNetwork::858] Error Code 2: Internal Error (Assertion mConfig.caskKlibMapPtr failed. ) FAILED and there is not a lot more info on it as it is an internal error.

stacktrace:

[06/18/2024-20:29:19] [TRT] [W] ITensor::setType(Half) was called on non I/O tensor: [ELEMENTWISE]-[clamp]-[x1_clamp_max]_output. This will have no effect unless this tensor is marked as an output.
[06/18/2024-20:29:19] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[06/18/2024-20:29:21] [TRT] [I] Detected 1 inputs and 1 output network tensors.
[06/18/2024-20:29:21] [TRT] [I] Total Host Persistent Memory: 256
[06/18/2024-20:29:21] [TRT] [I] Total Device Persistent Memory: 0
[06/18/2024-20:29:21] [TRT] [I] Total Scratch Memory: 0
[06/18/2024-20:29:21] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[06/18/2024-20:29:21] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.010141ms to assign 2 blocks to 2 nodes requiring 1024 bytes.
[06/18/2024-20:29:21] [TRT] [I] Total Activation Memory: 1024
[06/18/2024-20:29:21] [TRT] [I] Total Weights Memory: 0
[06/18/2024-20:29:21] [TRT] [I] Engine generation completed in 2.02002 seconds.
[06/18/2024-20:29:21] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 1 MiB
[06/18/2024-20:29:21] [TRT] [E] [builder.cpp::buildSerializedNetwork::858] Error Code 2: Internal Error (Assertion mConfig.caskKlibMapPtr failed. )

Environment

TensorRT Version: 10.1.0.27

NVIDIA GPU: A10G/3070

NVIDIA Driver Version: 535

CUDA Version: 12.1

CUDNN Version: 8.9.6

Operating System:

Python Version (if applicable): 3.10

Tensorflow Version (if applicable):

PyTorch Version (if applicable): 2.1.0

Baremetal or Container (if so, version): baremetal Ubuntu 22.04

Relevant Files

Model link:

Steps To Reproduce

I have seen this issue with 10.1.0.20 & 10.1.0.27 and feel like this is something that breaks the 10.1 release. Would that caused by some missing kernel libraries? Let me know if this has yet to be seen and you need a repro for this.

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

shingjan avatar Jun 18 '24 20:06 shingjan

This issue is caused by builder.reset(). I have a better repro:

import tensorrt as trt

def test(use_reset = False):
    builder = trt.Builder(trt.Logger())
    config = builder.create_builder_config()
    if use_reset:
        builder.reset()  # this triggers the engine build failure
    network = builder.create_network(0)
    x = network.add_input('x', trt.float32, (1,))
    y = network.add_activation(x, trt.ActivationType.RELU).get_output(0)
    network.mark_output(y)
    engine = builder.build_serialized_network(network, config)
    assert engine != None
    print("pass")

test(use_reset=False)
test(use_reset=True)

haijieg avatar Jun 21 '24 00:06 haijieg

the reset will free resource that needed by the build process, checked the document is unclear in https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/Builder.html#tensorrt.Builder.reset

would updating the document helps here? or could you elaborate more on why we want to call reset before build_serialized_network? thanks!

ttyio avatar Aug 07 '24 05:08 ttyio

@shingjan , I will be closing this ticket due to our policy to close tickets with no activity for more than 21 days after a reply had been posted. Please reopen a new ticket if you still need help.

moraxu avatar Sep 07 '24 01:09 moraxu