TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Segmentation fault for TensorRT 8.6 when loading ONNX model via C++ API on GPU V100

Open lhai37 opened this issue 1 year ago • 1 comments

Description

I tried to use the C++ API to load the attached ONNX model but it fails with a segmentation fault (core dumped). Note: possibly related to https://github.com/NVIDIA/TensorRT/issues/3630, this is the same model but with fixed batch size of 1.

Environment

TensorRT Version: 8.6.1.6

NVIDIA GPU: V100

NVIDIA Driver Version: 545.23.08

CUDA Version: 12.1

CUDNN Version: 8.9.0.131-1+cuda12.1

Operating System: Ubuntu 20.04

Python Version (if applicable): N/A

Tensorflow Version (if applicable): N/A

PyTorch Version (if applicable): N/A

Baremetal or Container (if so, version): N/A

Relevant Files

Model link: https://drive.google.com/file/d/1uoy0EcJj8BKq1F_Fd8HuYofye9KkPBHu/view?usp=sharing

Output Log: trtsegfault.txt

Steps To Reproduce

Use this C++ code which follows the sample for loading an ONNX model:

#include "NvInfer.h"
#include "NvInferPlugin.h"
#include "NvOnnxConfig.h"
#include "NvOnnxParser.h"

class TestLogger : public nvinfer1::ILogger {
 public:
  void log(Severity severity,
           nvinfer1::AsciiChar const* msg) noexcept override {
    std::cout << msg << std::endl;
  }
};

TestLogger logger;
nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
const auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, logger);
auto model = "trtcppapi_segfault.onnx";
auto parsed = parser->parseFromFile(model, 0);
cudaStream_t profile_stream = 0;
cudaStreamCreate(&profile_stream);
config->setProfileStream(profile_stream);
nvinfer1::IHostMemory* plan = builder->buildSerializedNetwork(*network, *config);
nvinfer1::IRuntime* mRuntime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine* mEngine = mRuntime->deserializeCudaEngine(plan->data(), plan->size());
nvinfer1::IExecutionContext* context = mEngine->createExecutionContext();

Have you tried the latest release?: Yes

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Yes it can be run with polygraphy 0.49.0 on the same environment:

polygraphy run trtcppapi_segfault.onnx --onnxrt
[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored
[I] RUNNING | Command: polygraphy run trtcppapi_segfault.onnx --onnxrt
[I] onnxrt-runner-N0-01/24/24-12:37:23  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-01/24/24-12:37:23 
    ---- Inference Input(s) ----
    {img [dtype=float32, shape=(1, 3, 720, 1280)],
     seg [dtype=float32, shape=(1, 1, 720, 1280)]}
[I] onnxrt-runner-N0-01/24/24-12:37:23 
    ---- Inference Output(s) ----
    {mask [dtype=float32, shape=(1, 1, 720, 1280)]}
[I] onnxrt-runner-N0-01/24/24-12:37:23  | Completed 1 iteration(s) in 5096 ms | Average inference time: 5096 ms.
[I] PASSED | Runtime: 6.203s | Command: polygraphy run trtcppapi_segfault.onnx --onnxrt

lhai37 avatar Jan 24 '24 20:01 lhai37

@zerollzeng I encountered this issue as well with same trt version. When compiling & running sampleONNXMNIST sample, same problem appeared, but with a more clear error message:

&&&& RUNNING TensorRT.sample_onnx_mnist [TensorRT v8601] # ./sample_onnx_mnist
[02/29/2024-11:04:54] [I] Building and running a GPU inference engine for Onnx MNIST
[02/29/2024-11:04:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 19, GPU 252 (MiB)
[02/29/2024-11:04:54] [E] [TRT] 6: [libLoader.cpp::Impl::293] Error Code 6: Internal Error (Unable to load library: libnvinfer_builder_resource.so.8.6.1: libnvinfer_builder_resource.so.8.6.1: cannot open shared object file: No such file or directory)
&&&& FAILED TensorRT.sample_onnx_mnist [TensorRT v8601] # ./sample_onnx_mnist

I believe it's related to this issue. Patching libnvinfer rpath solved my issue: https://github.com/NVIDIA/TensorRT/issues/2218

glenguo06 avatar Feb 29 '24 11:02 glenguo06