ECCV2022-RIFE icon indicating copy to clipboard operation
ECCV2022-RIFE copied to clipboard

RIFE Optimisation for GeForce RTX 3090 and future GeForce RTX 4090

Open AIVFI opened this issue 2 years ago • 2 comments

Many thanks hzwer for creating the world's first video frame interpolation method based on machine learning, capable of real-time interpolation of HD files!

As confirmed by two people on the SVP forums, NVIDIA GeForce RTX 3070 Ti and NVIDIA GeForce RTX 3090 graphics cards allow real-time interpolation of 720p files without downscaling (scale=1.0) using the RIFE algorithm.

Interestingly, the tests showed that the NVIDIA GeForce RTX 3090 has the processing power to interpolate even 1080p files in real-time, which is probably the dream of any movie lover who wants to see movies as if they were real live.

But for some reason, the RIFE algorithm doesn't take full advantage of the most powerful consumer graphics card on the market today: NVIDIA GeForce RTX 3090.

Could you hzwer please look at the test results and tell us what is the bottleneck and if there is any way to improve the RIFE algorithm to better utilize the resources of the NVIDIA GeForce RTX 3090 and future graphics cards that will be even faster.

Testing was done with two different implementations of RIFE:

  1. vs-rife https://github.com/HolyWu/vs-rife + SVP https://www.svp-team.com/wiki/RIFE_AI_interpolation

  2. flowframes https://github.com/n00mkrad/flowframes

Test results:

vs-rife + SVP real time playback with x2 interpolation RIFE model: 3.8 scale=1.0

FP16 720p: Cuda ~40%, SVP index 1.0 (SUCCESS) FP16 1080p: Cuda jumps between 35% - 51%, SVP index N/A (FAILURE) https://www.svp-team.com/forum/viewtopic.php?pid=79497#p79497

Already this test shows that the spare computing power of CUDA should allow real-time interpolation of a 1080p file. More thorough re-encoding tests show that it is indeed not CUDA that is the bottleneck:

vs-rife + SVP re-encoding with x2 interpolation RIFE model: 3.8 scale=1.0

720p, FP16 FPS: 63.5 Cuda: 56%

720p, FP32 FPS: 69.8 Cuda: 58%

1080p, FP16 FPS: 26.9 Cuda: 62%

1080p, FP32 FPS: 28.1 Cuda: 66%

https://www.svp-team.com/forum/viewtopic.php?pid=79526#p79526

If anyone checks out the above link, let me point out right away that the Test-Time Augmentation parameter only changes the VapourSynth filter used when implementing RIFE in SVP: https://www.svp-team.com/forum/viewtopic.php?pid=79023#p79023

The above FP16 test results show no advantage over FP32. Additionally, the CUDA load is a maximum of 66% for 1080p files, while with re-encoding one would expect a load of 90-100%.

Even more thorough tests using Flowframes and various RIFE models confirm the above observations:

720p, FP32:

FPS 63.28 - Cuda ~45% - v3.8 FPS 59.43 - Cuda ~50% - v3.1 FPS 58.22 - Cuda ~70% - v2.4 FPS 58.39 - Cuda ~70% - v2.3 FPS 55.98 - Cuda 87% - v1.8

720p, FP16:

FPS 61.32 - Cuda ~45% - v3.8 FPS 60.03 - Cuda ~40% - v3.1 FPS 57.06 - Cuda ~50% - v2.4 FPS 57.20 - Cuda ~55% - v2.3 FPS 58.87 - Cuda ~70% - v1.8

1080p, FP32:

FPS: 27.02 - Cuda: ~50% - v3.8

1080p, FP16:

FPS: 26.84 - Cuda: ~50% - v3.8

https://www.svp-team.com/forum/viewtopic.php?pid=79531#p79531 https://www.svp-team.com/forum/viewtopic.php?pid=79550#p79550

On these last tests you can see that the newer RIFE models do indeed put less load on CUDA, but unfortunately this is not followed by a clear increase in interpolated frames per second.

It looks as if it's not the processing power of the CUDA cores that's the bottleneck here, but something else entirely. Perhaps VRAM bandwidth is the limitation here? For the GeForce RTX 3090, that's 936 GB/sec. Is there a bottleneck here? Maybe data I/O process? Maybe too little parallel processing?

Hzwer, what do you think is the bottleneck here and is it possible to somehow optimize RIFE to achieve real-time interpolation for 1080p files without downscaling (scale=1.0)?

One more thing I wonder: there is no difference in performance using FP16 versus FP32, and if there is it is at a disadvantage.

In the case of the GeForce RTX 3090 graphics card, actually using CUDA Cores gives similar processing power for FP32 and FP16:

35.6 Peak FP32 TFLOPS (non-Tensor) 35.6 Peak FP16 TFLOPS (non-Tensor)

However, using Tensor Cores gives much more processing power than CUDA Cores and the difference between FP16 and FP32 is already double!

142/284 Peak FP16 Tensor TFLOPS with FP16 Accumulate 71/142 Peak FP16 Tensor TFLOPS with FP32 Accumulate

Data source - page 44: https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf

If it were possible to use FP16 precision on Tensor Cores and eliminate above bottleneck we could successfully interpolate 1080p files in real time playback without having to re-encode them, and even think about x4 interpolation in real time!!!

AIVFI avatar Dec 02 '21 22:12 AIVFI

You can interpolate 2x 1080p in real-time with weaker GPUs.

RIFE utilizes my 3090 fully.

You should also try RIFE 4.0 which is even faster and produces better results than any 3.xx model. It's only beaten by 2.3/2.4 models, which are however more than twice as slow.

PweSol avatar Dec 20 '21 19:12 PweSol

Thanks for your answer. Do you use scale=1.0 or scale=0.5?

AIVFI avatar Dec 21 '21 00:12 AIVFI