TensorRT after tensorrt init finished, my own stream invalid

I have a TensorRT ver 8.2.3.0 and a nvJPEG ver 11.6.2.8 Tesla 4 GPU with driver ver 510.47.03 ubuntu & x86_64 here's my situation: I made a dynamic library which encapsulate nvJPEG decoder operations, contain mainly 2 functions, one is init() which create stream and buffers and so on, the other is decode() uses handler inited by init() and doing decode work I also have a main function which is python use TensorRT to do predict work, and dlopen previous dynamic library with DEEPBIND option

here is the problem: if I call dynamic library function init() first which contain cudaStreamCreateWithFlags and then init tensorRT, then cudaEventRecord returns 400 "invalid resource handle" which means invalid stream

then I put function init() after TensorRT init, every thing goes well

so what is the real problem, Does TensorRT invalid all process stream?

Aug 11 '22 08:08 neiblegy

Can you provide a minimal reproduce here?

Aug 11 '22 12:08 zerollzeng

here is the code show: my nvjpeg init() function, its c++ code compiled as dynamic lib(named libaipnvjpg.so), and use version-script/retain-symbols-file to export only my code's symbols :

    CHECK_NVJPEG(nvjpegCreateEx(way, &dev_allocator, &pinned_allocator, NVJPEG_FLAGS_DEFAULT,  &_ctx.nvjpeg_handle));
    CHECK_NVJPEG(nvjpegJpegStateCreate(_ctx.nvjpeg_handle, &_ctx.nvjpeg_state));
    CHECK_NVJPEG(nvjpegDecoderCreate(_ctx.nvjpeg_handle, NVJPEG_BACKEND_DEFAULT, &_ctx.nvjpeg_decoder));
    CHECK_NVJPEG(nvjpegDecoderStateCreate(_ctx.nvjpeg_handle, _ctx.nvjpeg_decoder, &_ctx.nvjpeg_decoupled_state));
    CHECK_NVJPEG(nvjpegBufferPinnedCreate(_ctx.nvjpeg_handle, nullptr, &_ctx.pinned_buffers[0]));
    CHECK_NVJPEG(nvjpegBufferPinnedCreate(_ctx.nvjpeg_handle, nullptr, &_ctx.pinned_buffers[1]));
    CHECK_NVJPEG(nvjpegBufferDeviceCreate(_ctx.nvjpeg_handle, nullptr, &_ctx.device_buffer));
    CHECK_NVJPEG(nvjpegJpegStreamCreate(_ctx.nvjpeg_handle, &_ctx.jpeg_streams[0]));
    CHECK_NVJPEG(nvjpegJpegStreamCreate(_ctx.nvjpeg_handle, &_ctx.jpeg_streams[1]));
    CHECK_NVJPEG(nvjpegDecodeParamsCreate(_ctx.nvjpeg_handle, &_ctx.nvjpeg_decode_params));
    ...
    CHECK_CUDA(cudaStreamCreateWithFlags(&_ctx.stream, cudaStreamNonBlocking)); // this _ctx.stream is invalid after trt init

and then I use python to do trt init work just like this


import pycuda.driver as cuda
import tensorrt as trt
...

class Buffer:
    ...

    def _allocate_mem(self, engine):
        for name in engine:
            size = trt.volume(self._name2shape[name])
            # shape = engine.get_binding_shape(name)
            dtype = trt.nptype(engine.get_binding_dtype(name))
            hdm = HostDeviceMem(dtype, size)
            self._bindings.append(hdm.binding)
            if engine.binding_is_input(name=name):
                self._inputs[name] = hdm
            else:
                self._outputs[name] = hdm 
    ...
    def load_engine(model_path):
        TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
        plan = open(model_path, "rb").read()
        runtime = trt.Runtime(TRT_LOGGER)
        return runtime.deserialize_cuda_engine(plan)

class TRTPredictor:
    def __init__(self, config=None):
        # load plugin
        lib_plugin = config.get("lib_plugin", "")
        if lib_plugin:
            ctypes.CDLL(lib_plugin)

        cuda.init()
        self._cuda_device = cuda.Device(0)
        self._cuda_ctx = self._cuda_device.make_context()

        model_path = config["model_path"]
        print(model_path)
        self._engine = TRTPredictor.load_engine(model_path)
        self._context = self._engine.create_execution_context()
        self._buffer = Buffer(self._engine, config["buffer"])
        self._stream = cuda.Stream()
        self._curr_batch_size = 1
      
      ....
    @staticmethod
    def load_engine(model_path):
        TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
        plan = open(model_path, "rb").read()
        runtime = trt.Runtime(TRT_LOGGER)
        return runtime.deserialize_cuda_engine(plan)
      .....

and then we do predict job, call decode() function in nvjpeg dynamic lib first (I use pybind11 to bind my c++ functions to python, and the wrapper code use dlopen("libaipnvjpg.so", RTLD_LAZY | RTLD_LOCAL | RTLD_DEEPBIND) to open my dynamic lib) here is the decode() function:

    ......    
   nvjpegGetImageInfo(_ctx.nvjpeg_handle, p_img_data, img_size, &r_ctx.channel_num, &subsampling, r_ctx.widths, r_ctx.heights);
    ......
    CHECK_CUDA(cudaMalloc(reinterpret_cast<void**>(&_ibuf[buf_idx].channel[c]), sz)); // gpu memory pool
    CHECK_CUDA(cudaStreamSynchronize(_ctx.stream));   // failed here , code 400 , invalid resource handle
    .......

Aug 12 '22 03:08 neiblegy

in addition , I use nvjpeg static library to build my dynamic library

Aug 12 '22 03:08 neiblegy

This is because in TRTPredictor's __init__ function, you created a new CUDA context, which overrides the default CUDA context that the stream was created with. So to use that stream, you need to first pop the new CUDA context you created so that the currently active context goes back to the default CUDA context.

Dec 02 '22 10:12 nvpohanh

Closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!

Jan 10 '23 02:01 ttyio