video2x Implement multi-threading to fully utilize computing resources

This ticket tracks the implementation of multi-threading.

Right now only the decoder and encoder are multi-threaded. The processors (Real-ESRGAN, RIFE, etc.) can also be multi-threaded to better utilize the available computing power and VRAM. This requires a major redesign of the processing pipeline. The structure will look something like:

flowchart LR
    A(Decoder Thread) -->|Decoded AVFrames| Q1(Queue)
    Q1 -->|Work stealing| T1(Processor Thread 1)
    Q1 -->|Work stealing| T2(Processor Thread 2)
    Q1 -->|Work stealing| T3(Processor Thread 3)
    T1 -->|Processed AVFrames| Q2(Queue)
    T2 -->|Processed AVFrames| Q2
    T3 -->|Processed AVFrames| Q2
    Q2 --> E(Encoder Thread)

Jan 24 '25 20:01 k4yt3x

That would be totally great. My Processors are all cold and don't do anything. Only my GPU is working. Btw.: Thanks for the upload. I will test it.

Jan 25 '25 09:01 Pete4K

Would it be an idea to combine TensorRT and NCNN for efficient inference across many GPUs for still better speed, too? I don't know if TensorRT works with this.

Jan 25 '25 12:01 Pete4K

It seems that TensorRT could possibly make Real ESRGAN x4 Plus faster: https://github.com/yuvraj108c/ComfyUI-Upscaler-Tensorrt

Jan 25 '25 12:01 Pete4K

My Processors are all cold and don't do anything. Only my GPU is working.

I don't think I'll do multi-GPU support just yet. The workload will still be on on GPU.

Would it be an idea to combine TensorRT and NCNN for efficient inference across many GPUs for still better speed, too?

TensorRT only works on NVIDIA GPUs. If we need to support it then we'll need to support multiple backends simultaneously and dynamically select which one to use during runtme. We'll also need to include multiple versions of models. I don't think that's ideal. This better belongs under #1231.

Jan 26 '25 23:01 k4yt3x

Sorry, I don't mean GPU-Multi Support. I meant only implementing multi-threading would be a great Idea.

Jan 27 '25 06:01 Pete4K

Ok, when the Models are supported ist the best thing

Jan 27 '25 06:01 Pete4K

Hey @k4yt3x I was trying out the project using the Real-ESRGAN model on a x2 scale and noticed that my upscaling of lower resolutions on a RTX 4070 SUPER was fairly slow, using only 20% to 26% of the GPU for 6 to 3 frames seconds.

Could this be related to the current inference process not beeing multi-threaded yet?

Apr 02 '25 17:04 kitsumed

I implemented a rough mod, similar to what your diagram shows, but without the queue for async hand-off, and I'm starting to think that Real-ESRGAN is just very compute bottlenecked. My XTX showed a bump from ~4FPS to ~5FPS with batch size 2.

VRAM usage is barely a concern, which I found surprising at first. But the weights, etc. aren't really all that big.

Jun 18 '25 07:06 NinjaPerson24119