video2x icon indicating copy to clipboard operation
video2x copied to clipboard

Implement multi-threading to fully utilize computing resources

Open k4yt3x opened this issue 11 months ago • 8 comments

This ticket tracks the implementation of multi-threading.

Right now only the decoder and encoder are multi-threaded. The processors (Real-ESRGAN, RIFE, etc.) can also be multi-threaded to better utilize the available computing power and VRAM. This requires a major redesign of the processing pipeline. The structure will look something like:

flowchart LR
    A(Decoder Thread) -->|Decoded AVFrames| Q1(Queue)
    Q1 -->|Work stealing| T1(Processor Thread 1)
    Q1 -->|Work stealing| T2(Processor Thread 2)
    Q1 -->|Work stealing| T3(Processor Thread 3)
    T1 -->|Processed AVFrames| Q2(Queue)
    T2 -->|Processed AVFrames| Q2
    T3 -->|Processed AVFrames| Q2
    Q2 --> E(Encoder Thread)

k4yt3x avatar Jan 24 '25 20:01 k4yt3x

That would be totally great. My Processors are all cold and don't do anything. Only my GPU is working. Btw.: Thanks for the upload. I will test it.

Pete4K avatar Jan 25 '25 09:01 Pete4K

Would it be an idea to combine TensorRT and NCNN for efficient inference across many GPUs for still better speed, too? I don't know if TensorRT works with this.

Pete4K avatar Jan 25 '25 12:01 Pete4K

It seems that TensorRT could possibly make Real ESRGAN x4 Plus faster: https://github.com/yuvraj108c/ComfyUI-Upscaler-Tensorrt

Pete4K avatar Jan 25 '25 12:01 Pete4K

My Processors are all cold and don't do anything. Only my GPU is working.

I don't think I'll do multi-GPU support just yet. The workload will still be on on GPU.

Would it be an idea to combine TensorRT and NCNN for efficient inference across many GPUs for still better speed, too?

TensorRT only works on NVIDIA GPUs. If we need to support it then we'll need to support multiple backends simultaneously and dynamically select which one to use during runtme. We'll also need to include multiple versions of models. I don't think that's ideal. This better belongs under #1231.

k4yt3x avatar Jan 26 '25 23:01 k4yt3x

Sorry, I don't mean GPU-Multi Support. I meant only implementing multi-threading would be a great Idea.

Pete4K avatar Jan 27 '25 06:01 Pete4K

Ok, when the Models are supported ist the best thing

Pete4K avatar Jan 27 '25 06:01 Pete4K

Hey @k4yt3x I was trying out the project using the Real-ESRGAN model on a x2 scale and noticed that my upscaling of lower resolutions on a RTX 4070 SUPER was fairly slow, using only 20% to 26% of the GPU for 6 to 3 frames seconds.

Could this be related to the current inference process not beeing multi-threaded yet?

kitsumed avatar Apr 02 '25 17:04 kitsumed

I implemented a rough mod, similar to what your diagram shows, but without the queue for async hand-off, and I'm starting to think that Real-ESRGAN is just very compute bottlenecked. My XTX showed a bump from ~4FPS to ~5FPS with batch size 2.

VRAM usage is barely a concern, which I found surprising at first. But the weights, etc. aren't really all that big.

NinjaPerson24119 avatar Jun 18 '25 07:06 NinjaPerson24119