Real-ESRGAN Faster video inference script.

Changes:

Moved the final scaling and uint8 quantization to GPU, reducing CPU and main memory bandwidth consumption (Line 225-227). 2.5x speed-up.
Instruct FFMPEG to use RGB frames instead of BGR so no need to swap channels (Line 70 and 148).
Batched inference (controlled by --batch parameter, default is 4). Crushed CUDA GPU util to 100%.
Instruct torch to make contiguous tensors after the BCHW -> BHWC transform on GPU (Line 227). So no need to copy the buffer before writing to FFMPEG (Line 167). Reduced output IO time by 10x.

The metrics above are measured on a 1920x1080 30 fps anime video. On AMD R9-5900HX CPU (8 cores 16 threads) and 3080 LP (16GB), FP16, the processing rate goes from 0.8 fps to 4.6 fps with the optimizations (575% speed-up!). About 7.6 GB VRAM is used. You also get 4.4 fps (550% speed-up) at batch size 2, which now requires about 4.4 GB VRAM.

The script is not yet extensively tested (have no idea how to go, need some advice), and does not support extracting frames first, face enhance, alpha or grayscale images. Extract frames and face enhance go through very different workflows so the optimizations may not be applicable. Alpha and grayscale should not be an issue for almost all videos to be processed.

See #619, #634, #531.

Jun 28 '23 01:06 eliphatfs

Tested. This really works! Thanks!!

Test results (480p, upscale parameter 2): From 5-7 frame/s (original codes) to average 30 frame/s. Same output quality. From GPU 3D usage 30% (mostly CPU) to GPU 3D usage 95-100%.

Jul 02 '23 04:07 DaDaDaDaDaYeah

Hmm, I have no idea why my result stays the same somehow. My run was ESRNet_4xplus, nproc=1, [480x270] --> [1920x1080], noFace, noExtractFirst on 4x A40 (48Gb VRAM each), 52 cpu cores with 920G RAM. Both the old and new inference script give ~3fr/s Screenshot from 2023-07-03 09-56-55

Yes I noticed that the batch approach boost up GPU utilization significantly (around 23Gb 100% on each GPU comparing to just 4Gb 60% ish). I didn't measure cpu in details but Htop shows that it's about the same. Also I tried with different models and configs, with different batch size but again the difference is only ~0.5fr/s. Would love to have some thoughts if possible

Jul 03 '23 03:07 tthg119

Hmm, I have no idea why my result stays the same somehow. My run was ESRNet_4xplus, nproc=1, [480x270] --> [1920x1080], noFace, noExtractFirst on 4x A40 (48Gb VRAM each), 52 cpu cores with 920G RAM. Both the old and new inference script give ~3fr/s

Yes I noticed that the batch approach boost up GPU utilization significantly (around 23Gb 100% on each GPU comparing to just 4Gb 60% ish). I didn't measure cpu in details but Htop shows that it's about the same. Also I tried with different models and configs, with different batch size but again the difference is only ~0.5fr/s. Would love to have some thoughts if possible

Could you attach the video for some analysis here?

Jul 03 '23 08:07 eliphatfs

25% faster for me!

Jul 13 '23 09:07 FNsi

May I ask a question? Do you know why without --fp32 the output will be white noise? (Also the main branch, Amd Rocm)

Jul 13 '23 09:07 FNsi

I am running the animevideov3 model without FP32 and the outputs are correct. Could you please provide more details about your setup? I don't have RoCM available and there may be flaws in some APIs with FP16 as it is relatively new and not as mature as CUDA. For a suggestion on debugging by yourself, you may record the output of each layer in the network on the same input in the two modes FP16 and FP32 and compare them. If all of them are very different, perhaps there is a problem with rocm on your hardware; if it starts to become very different after a specific layer, you may be running into precision issues and you can't do much without changing a model.

Jul 13 '23 11:07 eliphatfs

I am running the animevideov3 model without FP32 and the outputs are correct.

Sorry I tried with or without fp32 and there's no difference, whole white output at all.

Could you please provide more details about your setup?

Just simply using Python inference_video_fast.py --fp32 or not with general x3v4 model (the tiny denoise one, the master branch work fine with fp32)

Jul 13 '23 11:07 FNsi

This command is working fine on my machine:

python inference_realesrgan_video_fast.py --model_name=realesr-general-x4v3 -i "videos\2022-12-24 17-53-30.mp4" -s 2

Did I understand your input correctly?

Jul 13 '23 13:07 eliphatfs

Did I understand your input correctly?

I think you are right. I did it with -dn 0, I will try to use it again without it.

Jul 13 '23 13:07 FNsi

It also works here with -dn 0.

Jul 13 '23 13:07 eliphatfs

It also works here with -dn 0.

So I guess I need to debug into it...

Wait I did use the no nb_frames video patch and my input is a webm file. (still the original script works. )

Okay test the demo.mp4 find out fp16 has some detail and fp32 is just color blocks...

Jul 13 '23 13:07 FNsi

FYI, I observe torch.compile + channel_last provides 2x speedup (no tiling, no face enhancing, fp16) on NVIDIA A4000.

self.model = self.model.to(memory_format=torch.channels_last)
self.model = torch.compile(self.model)

Might be worth a shot?

I'm not sure how easy torch.compile fits with other features (face enhance, tiling), nor the end-to-end speedup in these cases (tiling and face enhace add CPU overheads).

Jul 20 '23 15:07 wacky6

FYI, I observe torch.compile + channel_last provides 2x speedup (no tiling, no face enhancing, fp16) on NVIDIA A4000.
self.model = self.model.to(memory_format=torch.channels_last)
self.model = torch.compile(self.model)
Might be worth a shot?

I'm not sure how easy torch.compile fits with other features (face enhance, tiling), nor the end-to-end speedup in these cases (tiling and face enhace add CPU overheads).

Thanks for your comments! Which model are you using? On my side, using channel_last seems to reduce performance by half. torch.compile is generally helpful for performance as it generates optimized code on kernel level.

Jul 21 '23 04:07 eliphatfs

Thanks for your comments! Which model are you using? On my side, using channel_last seems to reduce performance by half.

"Officlal" Real-ESRGAN x4 I suspect channel_last / channel_first gain will vary by device? Without channel_last, I get about 1.5x speedup on A4000.

Jul 21 '23 05:07 wacky6

https://github.com/pytorch/pytorch/issues/92542 I guess RRDB-based networks and VGG-based networks have different preferences for channel formats.

Jul 21 '23 07:07 eliphatfs

You could also add an option to change the default libx264 to h264_nvenc encoder for ffmpeg which would give an additional performance boost. It would require ffmpeg compiled with cuda support, hence this as an option.

Sep 18 '23 18:09 epistemex

how to use this on images instead of videos?

May 22 '24 13:05 aliencaocao

Real-ESRGAN Real-ESRGAN copied to clipboard

Faster video inference script.

Real-ESRGAN
Real-ESRGAN copied to clipboard