after tensorrt init finished, my own stream invalid
I have a TensorRT ver 8.2.3.0 and a nvJPEG ver 11.6.2.8 Tesla 4 GPU with driver ver 510.47.03 ubuntu & x86_64 here's my situation: I made a dynamic library which encapsulate nvJPEG decoder operations, contain mainly 2 functions, one is init() which create stream and buffers and so on, the other is decode() uses handler inited by init() and doing decode work I also have a main function which is python use TensorRT to do predict work, and dlopen previous dynamic library with DEEPBIND option
here is the problem: if I call dynamic library function init() first which contain cudaStreamCreateWithFlags and then init tensorRT, then cudaEventRecord returns 400 "invalid resource handle" which means invalid stream
then I put function init() after TensorRT init, every thing goes well
so what is the real problem, Does TensorRT invalid all process stream?
Can you provide a minimal reproduce here?
here is the code show: my nvjpeg init() function, its c++ code compiled as dynamic lib(named libaipnvjpg.so), and use version-script/retain-symbols-file to export only my code's symbols :
CHECK_NVJPEG(nvjpegCreateEx(way, &dev_allocator, &pinned_allocator, NVJPEG_FLAGS_DEFAULT, &_ctx.nvjpeg_handle));
CHECK_NVJPEG(nvjpegJpegStateCreate(_ctx.nvjpeg_handle, &_ctx.nvjpeg_state));
CHECK_NVJPEG(nvjpegDecoderCreate(_ctx.nvjpeg_handle, NVJPEG_BACKEND_DEFAULT, &_ctx.nvjpeg_decoder));
CHECK_NVJPEG(nvjpegDecoderStateCreate(_ctx.nvjpeg_handle, _ctx.nvjpeg_decoder, &_ctx.nvjpeg_decoupled_state));
CHECK_NVJPEG(nvjpegBufferPinnedCreate(_ctx.nvjpeg_handle, nullptr, &_ctx.pinned_buffers[0]));
CHECK_NVJPEG(nvjpegBufferPinnedCreate(_ctx.nvjpeg_handle, nullptr, &_ctx.pinned_buffers[1]));
CHECK_NVJPEG(nvjpegBufferDeviceCreate(_ctx.nvjpeg_handle, nullptr, &_ctx.device_buffer));
CHECK_NVJPEG(nvjpegJpegStreamCreate(_ctx.nvjpeg_handle, &_ctx.jpeg_streams[0]));
CHECK_NVJPEG(nvjpegJpegStreamCreate(_ctx.nvjpeg_handle, &_ctx.jpeg_streams[1]));
CHECK_NVJPEG(nvjpegDecodeParamsCreate(_ctx.nvjpeg_handle, &_ctx.nvjpeg_decode_params));
...
CHECK_CUDA(cudaStreamCreateWithFlags(&_ctx.stream, cudaStreamNonBlocking)); // this _ctx.stream is invalid after trt init
and then I use python to do trt init work just like this
import pycuda.driver as cuda
import tensorrt as trt
...
class Buffer:
...
def _allocate_mem(self, engine):
for name in engine:
size = trt.volume(self._name2shape[name])
# shape = engine.get_binding_shape(name)
dtype = trt.nptype(engine.get_binding_dtype(name))
hdm = HostDeviceMem(dtype, size)
self._bindings.append(hdm.binding)
if engine.binding_is_input(name=name):
self._inputs[name] = hdm
else:
self._outputs[name] = hdm
...
def load_engine(model_path):
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
plan = open(model_path, "rb").read()
runtime = trt.Runtime(TRT_LOGGER)
return runtime.deserialize_cuda_engine(plan)
class TRTPredictor:
def __init__(self, config=None):
# load plugin
lib_plugin = config.get("lib_plugin", "")
if lib_plugin:
ctypes.CDLL(lib_plugin)
cuda.init()
self._cuda_device = cuda.Device(0)
self._cuda_ctx = self._cuda_device.make_context()
model_path = config["model_path"]
print(model_path)
self._engine = TRTPredictor.load_engine(model_path)
self._context = self._engine.create_execution_context()
self._buffer = Buffer(self._engine, config["buffer"])
self._stream = cuda.Stream()
self._curr_batch_size = 1
....
@staticmethod
def load_engine(model_path):
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
plan = open(model_path, "rb").read()
runtime = trt.Runtime(TRT_LOGGER)
return runtime.deserialize_cuda_engine(plan)
.....
and then we do predict job, call decode() function in nvjpeg dynamic lib first (I use pybind11 to bind my c++ functions to python, and the wrapper code use dlopen("libaipnvjpg.so", RTLD_LAZY | RTLD_LOCAL | RTLD_DEEPBIND) to open my dynamic lib) here is the decode() function:
......
nvjpegGetImageInfo(_ctx.nvjpeg_handle, p_img_data, img_size, &r_ctx.channel_num, &subsampling, r_ctx.widths, r_ctx.heights);
......
CHECK_CUDA(cudaMalloc(reinterpret_cast<void**>(&_ibuf[buf_idx].channel[c]), sz)); // gpu memory pool
CHECK_CUDA(cudaStreamSynchronize(_ctx.stream)); // failed here , code 400 , invalid resource handle
.......
in addition , I use nvjpeg static library to build my dynamic library
This is because in TRTPredictor's __init__ function, you created a new CUDA context, which overrides the default CUDA context that the stream was created with. So to use that stream, you need to first pop the new CUDA context you created so that the currently active context goes back to the default CUDA context.
Closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!