[TRT] [E] 2: [ltWrapper.cpp::setupHeuristic::349] Error Code 2: Internal Error (Assertion cublasStatus == CUBLAS_STATUS_SUCCESS failed. )
Description
Anyone knows about this issue? I used the two patchs provide by NVIDIA official website for cuda 10.2, but it only works for model converting from onnx to trt, and this issue is still occured when evaluting with TensorRT.
Environment
TensorRT Version: 8.4.1.5 NVIDIA GPU: Tesla V100 NVIDIA Driver Version: 440.33.01 CUDA Version: 10.2 Operating System: Ubuntu 18.04 Python Version (if applicable): 3.7.10 PyTorch Version (if applicable): 1.15 Baremetal or Container (if so, version): Docker 20.10.7
Program
import torch import torchvision.models as models import os import numpy as np import tensorrt as trt import pycuda.driver as cuda import time
BATCH_SIZE = 32 USE_FP16 = True resnext50 = models.resnext50_32x4d(num_classes=10) dummy_input = torch.randn([BATCH_SIZE, 3, 224, 224], dtype=torch.float16) resnext50.half() resnext50, dummy_input = resnext50.cuda(), dummy_input.cuda() torch.onnx.export(resnext50, dummy_input, 'resnext50.onnx', verbose=False) os.system(r'./trtexec --onnx=resnext50.onnx --saveEngine=resnext50.trt --explicitBatch=32 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16')
target_dtype = np.float16 if USE_FP16 else np.float32
f = open("resnext50.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype) output = np.empty([BATCH_SIZE, 10], dtype = target_dtype) d_input = cuda.mem_alloc(1 * input_batch.nbytes) d_output = cuda.mem_alloc(1 * output.nbytes) bindings = [int(d_input), int(d_output)]
stream = cuda.Stream() preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch])
for i in range(1000): t0 = time.time() cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream) # context.execute_async_v2(bindings, stream.handle, None) # context.execute_async(BATCH_SIZE, bindings, stream.handle) context.execute_v2(bindings) cuda.memcpy_dtoh_async(output, d_output, stream) stream.synchronize() t = time.time() - t0 print("\rPrediction cost {:.4f}s".format(t), end='') print(output[0])
After exporting to onnx, can you run the model with trtexec? I would suspect the torch and TRT may use different cuda libraries.
After exporting to onnx, can you run the model with trtexec? I would suspect the torch and TRT may use different cuda libraries.
Sure, u can export a onnx model by pytorch, and two patchs for cuda 10.2, they do work for converting model from onnx to trt with trtexec, but this issue will be occured when u wanna predict ur data with trt file.
Can you share the onnx model here?
Can you share the onnx model here?
Sure, I uploaded it on MEGA Cloud, here is the link: https://mega.nz/file/ztNjESbT#AN6XshkvQQEq7TtCDqVxn1VKGg8_JWLiO5638ecBAv0 And one thing i realized, TensorRT works with its sample program but does not wth python.
Looks like there are similiar issue, https://github.com/NVIDIA/TensorRT/issues/1818 and https://github.com/NVIDIA/TensorRT/issues/2123 Can you check your cublaslt version in the log?
also https://github.com/NVIDIA/TensorRT/issues/866
Looks like there are similiar issue, #1818 and #2123 Can you check your cublaslt version in the log?
Yep, it's cublas 10.2.3.254. And I noticed a warnning message before this issue is printed out, pasting it as following: [07/05/2022-10:00:08] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.0 [07/05/2022-10:00:10] [TRT] [E] 2: [ltWrapper.cpp::setupHeuristic::349] Error Code 2: Internal Error (Assertion cublasStatus == CUBLAS_STATUS_SUCCESS failed. ) [07/05/2022-10:00:10] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. ) It works with its c++ programs but does not with python.
Can you try remove
import torch
import torchvision.models as models
and all the torch stuff in your script? only leave the trt part. like
import os
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import time
# build engine with trtexec
target_dtype = np.float16 if USE_FP16 else np.float32
f = open("resnext50.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype)
output = np.empty([BATCH_SIZE, 10], dtype = target_dtype)
d_input = cuda.mem_alloc(1 * input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)
bindings = [int(d_input), int(d_output)]
stream = cuda.Stream()
preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch])
for i in range(1000):
t0 = time.time()
cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream)
# context.execute_async_v2(bindings, stream.handle, None)
# context.execute_async(BATCH_SIZE, bindings, stream.handle)
context.execute_v2(bindings)
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()
t = time.time() - t0
print("\rPrediction cost {:.4f}s".format(t), end='')
print(output[0])
Can you try remove
import torch import torchvision.models as modelsand all the torch stuff in your script? only leave the trt part. like
import os import numpy as np import tensorrt as trt import pycuda.driver as cuda import time # build engine with trtexec target_dtype = np.float16 if USE_FP16 else np.float32 f = open("resnext50.trt", "rb") runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) engine = runtime.deserialize_cuda_engine(f.read()) context = engine.create_execution_context() input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype) output = np.empty([BATCH_SIZE, 10], dtype = target_dtype) d_input = cuda.mem_alloc(1 * input_batch.nbytes) d_output = cuda.mem_alloc(1 * output.nbytes) bindings = [int(d_input), int(d_output)] stream = cuda.Stream() preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch]) for i in range(1000): t0 = time.time() cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream) # context.execute_async_v2(bindings, stream.handle, None) # context.execute_async(BATCH_SIZE, bindings, stream.handle) context.execute_v2(bindings) cuda.memcpy_dtoh_async(output, d_output, stream) stream.synchronize() t = time.time() - t0 print("\rPrediction cost {:.4f}s".format(t), end='') print(output[0])
No, it doesn't work. And I tried to solve all warning messages but it doesn't too.
I didn't reproduce this in my environment with cuda 11.6, also seems you miss import pycuda.autoinit. can you try to upgrade the cuda 11?
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
my code
import os
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
import time
# build engine with trtexec
BATCH_SIZE=32
target_dtype = np.float16
f = open("resnext50.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype)
output = np.empty([BATCH_SIZE, 10], dtype = target_dtype)
d_input = cuda.mem_alloc(1 * input_batch.nbytes)
d_output = cuda.mem_alloc(1 * output.nbytes)
bindings = [int(d_input), int(d_output)]
stream = cuda.Stream()
preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch])
for i in range(1000):
t0 = time.time()
cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream)
# context.execute_async_v2(bindings, stream.handle, None)
# context.execute_async(BATCH_SIZE, bindings, stream.handle)
context.execute_v2(bindings)
cuda.memcpy_dtoh_async(output, d_output, stream)
stream.synchronize()
t = time.time() - t0
print("\rPrediction cost {:.4f}s".format(t), end='')
print(output[0])
my code
import os import numpy as np import pycuda.driver as cuda import pycuda.autoinit import tensorrt as trt import time # build engine with trtexec BATCH_SIZE=32 target_dtype = np.float16 f = open("resnext50.trt", "rb") runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) engine = runtime.deserialize_cuda_engine(f.read()) context = engine.create_execution_context() input_batch = np.random.randn(BATCH_SIZE, 224, 224, 3).astype(target_dtype) output = np.empty([BATCH_SIZE, 10], dtype = target_dtype) d_input = cuda.mem_alloc(1 * input_batch.nbytes) d_output = cuda.mem_alloc(1 * output.nbytes) bindings = [int(d_input), int(d_output)] stream = cuda.Stream() preprocessed_inputs = np.array([input.transpose([2, 0, 1]) for input in input_batch]) for i in range(1000): t0 = time.time() cuda.memcpy_htod_async(d_input, preprocessed_inputs, stream) # context.execute_async_v2(bindings, stream.handle, None) # context.execute_async(BATCH_SIZE, bindings, stream.handle) context.execute_v2(bindings) cuda.memcpy_dtoh_async(output, d_output, stream) stream.synchronize() t = time.time() - t0 print("\rPrediction cost {:.4f}s".format(t), end='') print(output[0])
Oh, it works? Yep, im gonna try to upgrade my environment to matched version or something. And more, i think this issue is caused by its python libs, cuz its c++ programs work with now settings and environment, so it's all ok about TensorRT, cuda and cudnn.
I've done this by reinstalling environment, may some problems with my docker settings or something....
closing old issues that is inactive for long time, thanks all!