Real-ESRGAN icon indicating copy to clipboard operation
Real-ESRGAN copied to clipboard

Slow render speeds

Open winstonyue opened this issue 1 year ago • 12 comments

Hi, I am running this command to upscale 720p anime: CUDA_VISIBLE_DEVICES=0 python inference_realesrgan_video.py -i inputs/video/onepiece_demo.mp4 -n realesr-animevideov3 -s 2 --suffix outx2 --num_process_per_gpu 2 The model generates at 3.5 frames per second with a 3080ti. Is this normal? The github says the anime video v3 model runs at 22.6 fps at 720p. I am running everything in Anaconda, does this reduce the performance? How can I make it faster?

winstonyue avatar May 01 '23 07:05 winstonyue

following

MaxTran96 avatar Jun 03 '23 08:06 MaxTran96

@winstonyue any luck? New user here, attempting to verify GPU usage over CPU. @MaxTran96

cbroker1 avatar Jun 04 '23 13:06 cbroker1

So I noticed that the frames per second counter isn't showing the total, I think it's displaying frames being generated per process on the gpu. When I set the num_process_per_gpu to 4 or higher, the fps lowered slightly, but my overall render time was reduced significantly. Overall frames per second were much higher, maybe around 20 or something. My gpu usage was nearing 100 percent as well.

winstonyue avatar Jun 07 '23 03:06 winstonyue

As for upscaling 1920x1080 videos 4x (then downscale 2x to produce 4k videos), there is a lot of time in converting output fp16 to fp32 and quantizing to uint8 on CPU. By moving this to GPU in utils.py:

output_img = (output_img.data.squeeze() * 255.0).clamp_(0, 255).byte().cpu().numpy()

I can double to triple the processing frame rate. But it is still slow (from <1 fps to 2.2 fps on 3080). num_process_per_gpu also does not utilize GPU efficiently. I think the correct way to go is to batch the inference. But I haven't implemented it yet.

eliphatfs avatar Jun 27 '23 14:06 eliphatfs

An experimental implementation of this is available at https://github.com/eliphatfs/Real-ESRGAN/blob/master/inference_realesrgan_video_fast.py Reduced IO overhead and implemented batch inference. On my 3080 LP it is able to boost 1080p to 4k upscaling from 0.8 fps to 4 fps with the default set batch size 4 (5x speed up!). The correctness is not yet verified systematically, but it looks good.

eliphatfs avatar Jun 27 '23 15:06 eliphatfs

@eliphatfs in order to implement your '_fast.py` script, may I simply drag and drop such into the vanilla Real-ESRGAN repo? I've attempted such and am running into this error:

image

edit: maybe I should just download and run from your repo? Thanks.

cbroker1 avatar Jun 27 '23 16:06 cbroker1

I did a few more tweaks and the frame rate is now 4.1 to 4.2 fps on my 3080 LP. For my testing video there is still 10% time doing IO, that can be made asynchronous with the main thread. But I think that is fair.

eliphatfs avatar Jun 28 '23 00:06 eliphatfs

@eliphatfs in order to implement your '_fast.py` script, may I simply drag and drop such into the vanilla Real-ESRGAN repo? I've attempted such and am running into this error:

image edit: maybe I should just download and run from your repo? Thanks.

Sorry. I added some instrumentation for profiling the code to analyze where the time is spent. I have removed them.

eliphatfs avatar Jun 28 '23 00:06 eliphatfs

P.S. The script does not yet support alpha channels or gray-scale. Only RGB videos are supported (I think this should be the case of 99% modern videos). extract_frame_first may be also problematic since PNG files may have alpha channels. Maybe I will work on RGBA/L sometime later, or maybe I will just drop these parameters. Face enhance is also not supported since it uses a different model.

eliphatfs avatar Jun 28 '23 00:06 eliphatfs

An experimental implementation of this is available at https://github.com/eliphatfs/Real-ESRGAN/blob/master/inference_realesrgan_video_fast.py Reduced IO overhead and implemented batch inference. On my 3080 LP it is able to boost 1080p to 4k upscaling from 0.8 fps to 4 fps with the default set batch size 4 (5x speed up!). The correctness is not yet verified systematically, but it looks good.

Awesome patch. I test it with my 4070ti, realesr-general-x4v3 model, from 2.12 fps to 3.9 fps. I have yet to set num_process_per_gpu or else.

w-tim avatar Sep 18 '23 14:09 w-tim

An experimental implementation of this is available at https://github.com/eliphatfs/Real-ESRGAN/blob/master/inference_realesrgan_video_fast.py Reduced IO overhead and implemented batch inference. On my 3080 LP it is able to boost 1080p to 4k upscaling from 0.8 fps to 4 fps with the default set batch size 4 (5x speed up!). The correctness is not yet verified systematically, but it looks good.

Could you provide same for "inference_realesrgan.py" to accelerate? I asked GPT 4.0 for 1 hr based on your file for inference_realesrgan_video_fast.py but he failed.

w-tim avatar Sep 18 '23 17:09 w-tim

To completely extract the performance of GPU, I have created a GitHub repo (https://github.com/Kiteretsu77/FAST_Anime_VSR). I implemented it in TensorRT version and utilized a frame division algorithm (self-designed) to accelerate it (with a video redundancy jump mechanism [similar to video compression Inter-Prediction] and a momentum mechanism). Plus, I use FFMPEG to decode a smaller FPS for faster processing but the quality drop is extremely negligible. Plus, I use multiprocessing and multithreading to completely consume all computation resources. Feel free to look at this slide [https://docs.google.com/presentation/d/1Gxux9MdWxwpnT4nDZln8Ip_MeqalrkBesX34FVupm2A/edit#slide=id.p] for the implementation and algorithm I have used. In my 3060Ti Desktop version, my code can process faster than the Real-Time Anime videos for 480p video input.

Kiteretsu77 avatar Sep 18 '23 18:09 Kiteretsu77