ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: ocr.py “if False and ort.get_device() == "GPU":“ should be or

Open EakAip opened this issue 9 months ago • 10 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch name

deepdoc

Commit ID

vwvwva

Other environment information

4090

Expected behavior

OCR.py “if False and ort.get_device() == "GPU":“ should be or

EakAip avatar May 16 '24 02:05 EakAip

I discovered that onnx-runtime didn't support GPU well. So, if you want to try it, just remove the 'False' and let it use GPU.

KevinHuSh avatar May 16 '24 03:05 KevinHuSh

I tried converting an ONNX model to a TRT model and running it on the TensorRT framework, the speed is faster than ONNX Runtime and CPU @KevinHuSh

awesomeboy2 avatar Jun 05 '24 08:06 awesomeboy2

Also, it's worth mentioning that when performing inference, you should set the batch size larger to take advantage of GPU parallel computing, rather than doing inference one by one.

awesomeboy2 avatar Jun 05 '24 08:06 awesomeboy2

I tried converting an ONNX model to a TRT model and running it on the TensorRT framework, the speed is faster than ONNX Runtime and CPU @KevinHuSh

Would you kindly share your converted TRT models here?

StanleyOf427 avatar Jun 06 '24 03:06 StanleyOf427

I tried converting an ONNX model to a TRT model and running it on the TensorRT framework, the speed is faster than ONNX Runtime and CPU @KevinHuSh

Would you kindly share your converted TRT models here?

Ofcouse, here is the model file. and below is the dummy demo.

import time
from collections import OrderedDict
from typing import Dict, OrderedDict, List, Union
import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
from collections import OrderedDict

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)


def load_engine(path):
    print("Start loading engine")
    with open(path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
    print('Completed loading engine')
    return engine


class OutputAllocator(trt.IOutputAllocator):
    def __init__(self):
        print("[MyOutputAllocator::__init__]")
        super().__init__()
        self.buffers = {}
        self.shapes = {}

    def reallocate_output(self, tensor_name: str, memory: int, size: int, alignment: int) -> int:
        print("[MyOutputAllocator::reallocate_output] TensorName=%s, Memory=%s, Size=%d, Alignment=%d" % (tensor_name, memory, size, alignment))
        if tensor_name in self.buffers:
            del self.buffers[tensor_name]

        address = cuda.mem_alloc(size)
        self.buffers[tensor_name] = address
        return int(address)

    def notify_shape(self, tensor_name: str, shape: trt.Dims):
        print("[MyOutputAllocator::notify_shape] TensorName=%s, Shape=%s" % (tensor_name, shape))
        self.shapes[tensor_name] = tuple(shape)


def get_input_tensor_names(engine: trt.ICudaEngine) -> list[str]:
    input_tensor_names = []
    for binding in engine:
        if engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
            input_tensor_names.append(binding)
    return input_tensor_names


def get_output_tensor_names(engine: trt.ICudaEngine) -> list[str]:
    output_tensor_names = []
    for binding in engine:
        if engine.get_tensor_mode(binding) == trt.TensorIOMode.OUTPUT:
            output_tensor_names.append(binding)
    return output_tensor_names


class ProcessorV3:
    def __init__(self, engine: trt.ICudaEngine):
        self.engine = engine
        self.output_allocator = OutputAllocator()
        # create execution context
        self.context = engine.create_execution_context()
        # get input and output tensor names
        self.input_tensor_names = get_input_tensor_names(engine)
        self.output_tensor_names = get_output_tensor_names(engine)

        # create stream
        self.stream = cuda.Stream()
        # Create a CUDA events
        self.start_event = cuda.Event()
        self.end_event = cuda.Event()

    # def __del__(self):
    #     self.cuda_context.pop()

    def get_last_inference_time(self):
        return self.start_event.time_till(self.end_event)

    def infer(self, inputs: Union[Dict[str, np.ndarray], List[np.ndarray], np.ndarray]) -> OrderedDict[str, np.ndarray]:
        """
        inference process:
        1. create execution context
        2. set input shapes
        3. allocate memory
        4. copy input data to device
        5. run inference on device
        6. copy output data to host and reshape
        """
        # set input shapes, the output shapes are inferred automatically

        if isinstance(inputs, np.ndarray):
            inputs = [inputs]
        if isinstance(inputs, dict):
            inputs = [inp if name in self.input_tensor_names else None for (name, inp) in inputs.items()]
        if isinstance(inputs, list):
            for name, arr in zip(self.input_tensor_names, inputs):
                self.context.set_input_shape(name, arr.shape)
        buffers_host = []
        buffers_device = []
        # copy input data to device
        for name, arr in zip(self.input_tensor_names, inputs):
            host = cuda.pagelocked_empty(arr.shape, dtype=trt.nptype(self.engine.get_tensor_dtype(name)))
            device = cuda.mem_alloc(arr.nbytes)

            host[:] = arr
            cuda.memcpy_htod_async(device, host, self.stream)
            buffers_host.append(host)
            buffers_device.append(device)
        # set input tensor address
        for name, buffer in zip(self.input_tensor_names, buffers_device):
            self.context.set_tensor_address(name, int(buffer))
        # set output tensor allocator
        for name in self.output_tensor_names:
            self.context.set_tensor_address(name, 0)  # set nullptr
            self.context.set_output_allocator(name, self.output_allocator)
        # The do_inference function will return a list of outputs

        # Record the start event
        self.start_event.record(self.stream)
        # Run inference.
        self.context.execute_async_v3(stream_handle=self.stream.handle)
        # Record the end event
        self.end_event.record(self.stream)

        # self.memory.copy_to_host()

        output_buffers = OrderedDict()
        for name in self.output_tensor_names:
            arr = cuda.pagelocked_empty(self.output_allocator.shapes[name],
                                        dtype=trt.nptype(self.engine.get_tensor_dtype(name)))
            cuda.memcpy_dtoh_async(arr, self.output_allocator.buffers[name], stream=self.stream)
            output_buffers[name] = arr

        # Synchronize the stream
        self.stream.synchronize()

        return output_buffers


if __name__ == "__main__":
    engine = load_engine("res/deepdoc/det.trt")
    processor = ProcessorV3(engine)
    for i in range(100):
        inputs = dict(x=np.random.random([1, 3, 960, 672]))
        start = time.time()
        outputs = processor.infer(inputs)
        print(outputs)
        print(f"cost time: {time.time() - start}")

awesomeboy2 avatar Jun 07 '24 03:06 awesomeboy2

emmm, i can't upload my file here, if you need, you can leave your email

here is the trt model file, you may unzip the file first.

awesomeboy2 avatar Jun 07 '24 03:06 awesomeboy2

emmm, i can't upload my file here, if you need, you can leave your email

here is the trt model file, you may unzip the file first.

My Email is [email protected], maybe you can upload to huggingface, that will help a lot more people.

StanleyOf427 avatar Jun 07 '24 04:06 StanleyOf427

I discovered that onnx-runtime didn't support GPU well. So, if you want to try it, just remove the 'False' and let it use GPU.

I discovered that it display some worning info like 2024-06-19 17:08:12.808610059 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-06-19 17:08:12.808639007 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2024-06-19 17:08:13.267821377 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-06-19 17:08:13.267852498 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. after I changed into if ort.get_device() == "GPU": in file ocr.py and recognizer.py.

I am not sure if this operation is the rigth way to run the project with GPU. And I waited a long time but it didn't output anything.

PureWaterCatt avatar Jun 19 '24 09:06 PureWaterCatt

well,I used FastAPI to wrap DeepDoc and employed the TensorRT inference framework for inference, resulting in more than a threefold increase in speed. https://github.com/peakhell/OCRIntegrator

peakhell avatar Jun 19 '24 09:06 peakhell

well,I used FastAPI to wrap DeepDoc and employed the TensorRT inference framework for inference, resulting in more than a threefold increase in speed. https://github.com/peakhell/OCRIntegrator

Can it be directly integrated into ragflow? 能否直接集成到ragflow中?

oom2018 avatar Jun 26 '24 14:06 oom2018