Real-ESRGAN
Real-ESRGAN copied to clipboard
Faster video inference script.
Changes:
- Moved the final scaling and uint8 quantization to GPU, reducing CPU and main memory bandwidth consumption (Line 225-227). 2.5x speed-up.
- Instruct FFMPEG to use RGB frames instead of BGR so no need to swap channels (Line 70 and 148).
- Batched inference (controlled by
--batch
parameter, default is4
). Crushed CUDA GPU util to 100%. - Instruct torch to make contiguous tensors after the BCHW -> BHWC transform on GPU (Line 227). So no need to copy the buffer before writing to FFMPEG (Line 167). Reduced output IO time by 10x.
The metrics above are measured on a 1920x1080 30 fps anime video. On AMD R9-5900HX CPU (8 cores 16 threads) and 3080 LP (16GB), FP16, the processing rate goes from 0.8 fps to 4.6 fps with the optimizations (575% speed-up!). About 7.6 GB VRAM is used. You also get 4.4 fps (550% speed-up) at batch size 2, which now requires about 4.4 GB VRAM.
The script is not yet extensively tested (have no idea how to go, need some advice), and does not support extracting frames first, face enhance, alpha or grayscale images. Extract frames and face enhance go through very different workflows so the optimizations may not be applicable. Alpha and grayscale should not be an issue for almost all videos to be processed.
See #619, #634, #531.
Tested. This really works! Thanks!!
Test results (480p, upscale parameter 2): From 5-7 frame/s (original codes) to average 30 frame/s. Same output quality. From GPU 3D usage 30% (mostly CPU) to GPU 3D usage 95-100%.
Hmm, I have no idea why my result stays the same somehow. My run was ESRNet_4xplus, nproc=1, [480x270] --> [1920x1080], noFace, noExtractFirst
on 4x A40 (48Gb VRAM each), 52 cpu cores with 920G RAM. Both the old and new inference script give ~3fr/s
Yes I noticed that the batch approach boost up GPU utilization significantly (around 23Gb 100% on each GPU comparing to just 4Gb 60% ish). I didn't measure cpu in details but Htop shows that it's about the same. Also I tried with different models and configs, with different batch size but again the difference is only ~0.5fr/s. Would love to have some thoughts if possible
Hmm, I have no idea why my result stays the same somehow. My run was
ESRNet_4xplus, nproc=1, [480x270] --> [1920x1080], noFace, noExtractFirst
on 4x A40 (48Gb VRAM each), 52 cpu cores with 920G RAM. Both the old and new inference script give ~3fr/sYes I noticed that the batch approach boost up GPU utilization significantly (around 23Gb 100% on each GPU comparing to just 4Gb 60% ish). I didn't measure cpu in details but Htop shows that it's about the same. Also I tried with different models and configs, with different batch size but again the difference is only ~0.5fr/s. Would love to have some thoughts if possible
Could you attach the video for some analysis here?
25% faster for me!
May I ask a question? Do you know why without --fp32 the output will be white noise? (Also the main branch, Amd Rocm)
I am running the animevideov3 model without FP32 and the outputs are correct. Could you please provide more details about your setup? I don't have RoCM available and there may be flaws in some APIs with FP16 as it is relatively new and not as mature as CUDA. For a suggestion on debugging by yourself, you may record the output of each layer in the network on the same input in the two modes FP16 and FP32 and compare them. If all of them are very different, perhaps there is a problem with rocm on your hardware; if it starts to become very different after a specific layer, you may be running into precision issues and you can't do much without changing a model.
I am running the animevideov3 model without FP32 and the outputs are correct.
Sorry I tried with or without fp32 and there's no difference, whole white output at all.
Could you please provide more details about your setup?
Just simply using Python inference_video_fast.py --fp32 or not with general x3v4 model (the tiny denoise one, the master branch work fine with fp32)
This command is working fine on my machine:
python inference_realesrgan_video_fast.py --model_name=realesr-general-x4v3 -i "videos\2022-12-24 17-53-30.mp4" -s 2
Did I understand your input correctly?
Did I understand your input correctly?
I think you are right. I did it with -dn 0, I will try to use it again without it.
It also works here with -dn 0.
It also works here with -dn 0.
So I guess I need to debug into it...
Wait I did use the no nb_frames video patch and my input is a webm file. (still the original script works. )
Okay test the demo.mp4 find out fp16 has some detail and fp32 is just color blocks...
FYI, I observe torch.compile + channel_last
provides 2x speedup (no tiling, no face enhancing, fp16) on NVIDIA A4000.
self.model = self.model.to(memory_format=torch.channels_last)
self.model = torch.compile(self.model)
Might be worth a shot?
I'm not sure how easy torch.compile
fits with other features (face enhance, tiling), nor the end-to-end speedup in these cases (tiling and face enhace add CPU overheads).
FYI, I observe
torch.compile + channel_last
provides 2x speedup (no tiling, no face enhancing, fp16) on NVIDIA A4000.self.model = self.model.to(memory_format=torch.channels_last) self.model = torch.compile(self.model)
Might be worth a shot?
I'm not sure how easy
torch.compile
fits with other features (face enhance, tiling), nor the end-to-end speedup in these cases (tiling and face enhace add CPU overheads).
Thanks for your comments! Which model are you using? On my side, using channel_last seems to reduce performance by half.
torch.compile
is generally helpful for performance as it generates optimized code on kernel level.
Thanks for your comments! Which model are you using? On my side, using channel_last seems to reduce performance by half.
"Officlal" Real-ESRGAN x4 I suspect channel_last / channel_first gain will vary by device? Without channel_last, I get about 1.5x speedup on A4000.
https://github.com/pytorch/pytorch/issues/92542 I guess RRDB-based networks and VGG-based networks have different preferences for channel formats.
You could also add an option to change the default libx264 to h264_nvenc encoder for ffmpeg which would give an additional performance boost. It would require ffmpeg compiled with cuda support, hence this as an option.
how to use this on images instead of videos?