TensorRT [TRT] [E] 2: [ltWrapper.cpp::setupHeuristic::349] Error Code 2: Internal Error (Assertion cublasStatus == CUBLAS_STATUS

Description

Anyone knows about this issue? I used the two patchs provide by NVIDIA official website for cuda 10.2, but it only works for model converting from onnx to trt, and this issue is still occured when evaluting with TensorRT.

Environment

TensorRT Version: 8.4.1.5 NVIDIA GPU: Tesla V100 NVIDIA Driver Version: 440.33.01 CUDA Version: 10.2 Operating System: Ubuntu 18.04 Python Version (if applicable): 3.7.10 PyTorch Version (if applicable): 1.15 Baremetal or Container (if so, version): Docker 20.10.7

Program

import torch import torchvision.models as models import os import numpy as np import tensorrt as trt import pycuda.driver as cuda import time

BATCH_SIZE = 32 USE_FP16 = True resnext50 = models.resnext50_32x4d(num_classes=10) dummy_input = torch.randn([BATCH_SIZE, 3, 224, 224], dtype=torch.float16) resnext50.half() resnext50, dummy_input = resnext50.cuda(), dummy_input.cuda() torch.onnx.export(resnext50, dummy_input, 'resnext50.onnx', verbose=False) os.system(r'./trtexec --onnx=resnext50.onnx --saveEngine=resnext50.trt --explicitBatch=32 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16')

target_dtype = np.float16 if USE_FP16 else np.float32
f = open("resnext50.trt", "rb") runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) engine = runtime.deserialize_cuda_engine(f.read()) context = engine.create_execution_context()

input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype) output = np.empty([BATCH_SIZE, 10], dtype = target_dtype) d_input = cuda.mem_alloc(1 * input_batch.nbytes) d_output = cuda.mem_alloc(1 * output.nbytes) bindings = [int(d_input), int(d_output)]

stream = cuda.Stream() preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch])

for i in range(1000): t0 = time.time() cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream) # context.execute_async_v2(bindings, stream.handle, None) # context.execute_async(BATCH_SIZE, bindings, stream.handle) context.execute_v2(bindings) cuda.memcpy_dtoh_async(output, d_output, stream) stream.synchronize() t = time.time() - t0 print("\rPrediction cost {:.4f}s".format(t), end='') print(output[0])

Jul 05 '22 07:07 EinKung

After exporting to onnx, can you run the model with trtexec? I would suspect the torch and TRT may use different cuda libraries.

Jul 05 '22 08:07 zerollzeng

After exporting to onnx, can you run the model with trtexec? I would suspect the torch and TRT may use different cuda libraries.

Sure, u can export a onnx model by pytorch, and two patchs for cuda 10.2, they do work for converting model from onnx to trt with trtexec, but this issue will be occured when u wanna predict ur data with trt file.

Jul 05 '22 08:07 EinKung

Can you share the onnx model here?

Jul 05 '22 09:07 zerollzeng

Can you share the onnx model here?

Sure, I uploaded it on MEGA Cloud, here is the link: https://mega.nz/file/ztNjESbT#AN6XshkvQQEq7TtCDqVxn1VKGg8_JWLiO5638ecBAv0 And one thing i realized, TensorRT works with its sample program but does not wth python.

Jul 05 '22 09:07 EinKung

Looks like there are similiar issue, https://github.com/NVIDIA/TensorRT/issues/1818 and https://github.com/NVIDIA/TensorRT/issues/2123 Can you check your cublaslt version in the log?

Jul 05 '22 10:07 zerollzeng

also https://github.com/NVIDIA/TensorRT/issues/866

Jul 05 '22 10:07 zerollzeng

Looks like there are similiar issue, #1818 and #2123 Can you check your cublaslt version in the log?

Yep, it's cublas 10.2.3.254. And I noticed a warnning message before this issue is printed out, pasting it as following: [07/05/2022-10:00:08] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.0 [07/05/2022-10:00:10] [TRT] [E] 2: [ltWrapper.cpp::setupHeuristic::349] Error Code 2: Internal Error (Assertion cublasStatus == CUBLAS_STATUS_SUCCESS failed. ) [07/05/2022-10:00:10] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. ) It works with its c++ programs but does not with python.

Jul 06 '22 01:07 EinKung

Can you try remove

import torch
import torchvision.models as models

and all the torch stuff in your script? only leave the trt part. like

import os
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import time

# build engine with trtexec

target_dtype = np.float16 if USE_FP16 else np.float32
f = open("resnext50.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()

input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype)
output = np.empty([BATCH_SIZE, 10], dtype = target_dtype)
d_input = cuda.mem_alloc(1 * input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)
bindings = [int(d_input), int(d_output)]

stream = cuda.Stream()
preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch])

for i in range(1000):
t0 = time.time()
cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream)
# context.execute_async_v2(bindings, stream.handle, None)
# context.execute_async(BATCH_SIZE, bindings, stream.handle)
context.execute_v2(bindings)
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()
t = time.time() - t0
print("\rPrediction cost {:.4f}s".format(t), end='')
print(output[0])

Jul 06 '22 03:07 zerollzeng

Can you try remove

import torch
import torchvision.models as models

and all the torch stuff in your script? only leave the trt part. like

import os
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import time

# build engine with trtexec

target_dtype = np.float16 if USE_FP16 else np.float32
f = open("resnext50.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()

input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype)
output = np.empty([BATCH_SIZE, 10], dtype = target_dtype)
d_input = cuda.mem_alloc(1 * input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)
bindings = [int(d_input), int(d_output)]

stream = cuda.Stream()
preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch])

for i in range(1000):
t0 = time.time()
cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream)
# context.execute_async_v2(bindings, stream.handle, None)
# context.execute_async(BATCH_SIZE, bindings, stream.handle)
context.execute_v2(bindings)
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()
t = time.time() - t0
print("\rPrediction cost {:.4f}s".format(t), end='')
print(output[0])

No, it doesn't work. And I tried to solve all warning messages but it doesn't too.

Jul 06 '22 06:07 EinKung

I didn't reproduce this in my environment with cuda 11.6, also seems you miss import pycuda.autoinit. can you try to upgrade the cuda 11?

import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt

Jul 06 '22 13:07 zerollzeng

my code

import os
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
import time

# build engine with trtexec

BATCH_SIZE=32

target_dtype = np.float16
f = open("resnext50.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()

input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype)
output = np.empty([BATCH_SIZE, 10], dtype = target_dtype)
d_input = cuda.mem_alloc(1 * input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)
bindings = [int(d_input), int(d_output)]

stream = cuda.Stream()
preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch])

for i in range(1000):
    t0 = time.time()
    cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream)
    # context.execute_async_v2(bindings, stream.handle, None)
    # context.execute_async(BATCH_SIZE, bindings, stream.handle)
    context.execute_v2(bindings)
    cuda.memcpy_dtoh_async(output, d_output, stream)
    stream.synchronize()
    t = time.time() - t0
    print("\rPrediction cost {:.4f}s".format(t), end='')
    print(output[0])

Jul 06 '22 13:07 zerollzeng

my code

import os
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
import time

# build engine with trtexec

BATCH_SIZE=32

target_dtype = np.float16
f = open("resnext50.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()

input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype)
output = np.empty([BATCH_SIZE, 10], dtype = target_dtype)
d_input = cuda.mem_alloc(1 * input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)
bindings = [int(d_input), int(d_output)]

stream = cuda.Stream()
preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch])

for i in range(1000):
    t0 = time.time()
    cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream)
    # context.execute_async_v2(bindings, stream.handle, None)
    # context.execute_async(BATCH_SIZE, bindings, stream.handle)
    context.execute_v2(bindings)
    cuda.memcpy_dtoh_async(output, d_output, stream)
    stream.synchronize()
    t = time.time() - t0
    print("\rPrediction cost {:.4f}s".format(t), end='')
    print(output[0])

Oh, it works? Yep, im gonna try to upgrade my environment to matched version or something. And more, i think this issue is caused by its python libs, cuz its c++ programs work with now settings and environment, so it's all ok about TensorRT, cuda and cudnn.

Jul 07 '22 01:07 EinKung

I've done this by reinstalling environment, may some problems with my docker settings or something....

Jul 08 '22 07:07 EinKung

closing old issues that is inactive for long time, thanks all!

Nov 23 '23 00:11 ttyio