[RFC] Initial draft for Linux hardware-accelerated decoding with VA-API
Addresses #1672
This pull-request adds support for hardware-accelerated decoding with VA-API on Linux platform.
I'm about 70% certain that this code can be better optimized, but even with this initial draft I'm seeing much lower CPU usage overall.
Few limitations here:
- VA-API/NV12 is hardcoded; we need a switch and auto-detection method.
- Rotation is broken.
- !SCRCPY_LAVF_HAS_NEW_ENCODING_DECODING_API is not taken care of.
- From a quick glimpse, it looks like it's easy to extend this to other H/W accelerated methods such as VDPAU or CUDA.
- CPU usage is still a bit high despite the fact that it's using VA-API.
- Reducing memory copy? I'm not sure if this is zero-copy or not, or whether zero-copy is even possible under VA-API and SDL.
Any suggestions would be nice :)
Wow, thank you for that :+1:
I tested the branch, but unfortunately on my computer (it probably depends on the computer):
- it does not really improve CPU usage
- globally, the latency is worse (did not measure though, but the latency increase is sometimes obvious when I open an Android app)
- there is a visual glitch on start (but it's not important)
I guess (but I'm not sure) that the performance issues come from the fact that the decoded image is "downloaded" to main memory (av_hwframe_transfer_data()/av_image_fill_arrays()).
To avoid this problem in VLC, the OpenGL video output imports the hardware picture via "interops". For example for VAAPI: https://code.videolan.org/videolan/vlc/-/blob/56c05e47af966d53cb9d32d0b8d7c33f11cd6fc6/modules/video_output/opengl/interop_vaapi.c
But scrcpy just uses SDL and does not implement OpenGL/Metal/DirectX directly (because that's too much work).
Wow, thank you for that
Thanks :smiley:
I tested the branch, but unfortunately on my computer (it probably depends on the computer):
- it does not really improve CPU usage
- globally, the latency is worse (did not measure though, but the latency increase is sometimes obvious when I open an Android app)
- there is a visual glitch on start (but it's not important)
Yeah, seems like it really depends on everything.
On my laptop(i7-8550U) with Ubuntu 20.04:
- The CPU usage comes down from 140% to 80%(still high, right?) (I thought leaving the phone on the camera app would stress video decoder the most but seems like scrolling back and forth in the app drawer does even more.)
- I do get inconsistent latencies sometimes, but overall it's imperceptible between the original master branch. And when I do get that inconsistent latencies, I think it slows down by 100-200ms, but it doesn't appear to skip frames.
- I do not get a visual glitch on start on both my OnePlus 7 Pro(Snapdragon 855) and 8 Pro(865).
On that note though, when I was working on this initially(probably 18 hours ago), it would totally break on a Samsung Galaxy S10e with Exynos processor.
My guess is that even more care for different processors are required..? I'll try again when I get back home and play around with Exynos.
I guess that the performance issues come from the fact that the decoded image is "downloaded" to main memory (
av_hwframe_transfer_data()/av_image_fill_arrays()).To avoid this problem in VLC, the OpenGL video output imports the hardware picture via "interops". For example for VAAPI: https://code.videolan.org/videolan/vlc/-/blob/56c05e47af966d53cb9d32d0b8d7c33f11cd6fc6/modules/video_output/opengl/interop_vaapi.c
But scrcpy just uses SDL and does not implement OpenGL/Metal/DirectX directly (because that's too much work).
This is my first time working on calling FFmpeg/libav* functions directly haha
I should probably look into SDL a bit more. Referencing LookingGlass might give some pointers too, seems like it's also using SDL.
Thanks for the comment :+1:
Found something interesting.
After upgrading the graphics stack and X server to 1.20.9(Ubuntu 20.10's build), the CPU usage was reduced even more. https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers https://launchpad.net/ubuntu/+source/xorg-server/2:1.20.9-2ubuntu1
S/W decoding: 120% H/W decoding: 50%
it's great to see this being worked on, @arter97 thanks :tada:
tested it & saw significant save in resources, and for me when running both side-by-side, latency seems lower with vaapi.
S/W decoding: ~70% CPU H/W decoding: ~20% CPU
on fedora 33 using intel-media-driver
vainfo: VA-API version: 1.9 (libva 2.9.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 20.3.0 ()
Yeah, seems like something's not right with Exynos.
Both my Galaxy S9 and S10e's video cannot be decoded with this VA-API code. Pixel format doesn't return AV_PIX_FMT_VAAPI for some reason.
INFO: scrcpy 1.16 <https://github.com/Genymobile/scrcpy>
build_release//server/scrcpy-server: 1 file pushed, 0 skipped. 101.0 MB/s (33694 bytes in 0.000s)
[server] INFO: Device: samsung SM-G965N (Android 10)
INFO: Renderer: opengl
INFO: OpenGL version: 4.6 (Compatibility Profile) Mesa 20.3.0-devel (git-fb1793b 2020-11-08 focal-oibaf-ppa)
INFO: Trilinear filtering enabled
INFO: Initial texture: 1440x2960
ERROR: Unable to decode using VA-API
ERROR: Could not send video packet: -1094995529
ERROR: Could not process frame
WARN: Device disconnected
Video is hard lol
It seems that this is unable to detect my version of VA-API. Here is the scrcpy output and here is my vainfo.
./run x output:
INFO: scrcpy 1.16 <https://github.com/Genymobile/scrcpy>
x/server/scrcpy-server: 1 file pushed. 3.6 MB/s (33622 bytes in 0.009s)
[server] INFO: Device: samsung SM-G973U (Android 10)
INFO: Renderer: opengl
INFO: OpenGL version: 4.6 (Compatibility Profile) Mesa 20.0.8
INFO: Trilinear filtering enabled
INFO: Initial texture: 1080x2280
ERROR: Unable to decode using VA-API
ERROR: Could not send video packet: -1094995529
ERROR: Could not process frame
WARN: Device disconnected
vainfo output:
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_7
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.7 (libva 2.6.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 20.1.1 ()
vainfo: Supported profile and entrypoints
VAProfileMPEG2Simple : VAEntrypointVLD
VAProfileMPEG2Main : VAEntrypointVLD
VAProfileH264Main : VAEntrypointVLD
VAProfileH264Main : VAEntrypointEncSliceLP
VAProfileH264High : VAEntrypointVLD
VAProfileH264High : VAEntrypointEncSliceLP
VAProfileJPEGBaseline : VAEntrypointVLD
VAProfileJPEGBaseline : VAEntrypointEncPicture
VAProfileH264ConstrainedBaseline: VAEntrypointVLD
VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
VAProfileVP8Version0_3 : VAEntrypointVLD
VAProfileHEVCMain : VAEntrypointVLD
VAProfileHEVCMain10 : VAEntrypointVLD
VAProfileVP9Profile0 : VAEntrypointVLD
VAProfileVP9Profile2 : VAEntrypointVLD
inxi -Fxz output:
Device-1: Intel UHD Graphics 620 vendor: Dell driver: i915 v: kernel
bus ID: 00:02.0
Display: x11 server: X.Org 1.20.8 driver: modesetting unloaded: fbdev,vesa
resolution: 1920x1080~60Hz
OpenGL: renderer: Mesa Intel UHD Graphics 620 (KBL GT2) v: 4.6 Mesa 20.0.8
direct render: Yes
@arter97 If you're still interested in working on this (rebase/rewrite on dev branch and fix issues), we could enable it with an explicit option (--hw-dec) (I prefer to keep it disabled by default for now, because it will probably cause issues depending on the machine).
@rom1v Hi.
Unfortunately, I don't think I can work on scrcpy due to my current occupation. This was more of a POC. I expect a much better implementation could be had by someone with deeper knowledge than me.
Thanks.
@arter97 OK, no problem.
I'll work on it in the future (your PoC will help) :wink:
I rewrote the PoC on current dev branch: hw_dec_poc.
With VAAPI, get_hw_format is called with AV_PIX_FMT_VAAPI in the pix_fmts list, so I return AV_PIX_FMT_VAAPI. But then it is called once again without AV_PIX_FMT_VAAPI:
DEBUG: == get_hw_format ==
DEBUG: ==== vdpau (100)
DEBUG: ==== cuda (119)
DEBUG: ==== vaapi_vld (46)
DEBUG: == get_hw_format ==
DEBUG: ==== vdpau (100)
DEBUG: ==== cuda (119)
DEBUG: ==== yuv420p (0)
DEBUG: ==== yuv420p (0)
So I can't make VAAPI work anymore. That's weird. (Note that I can use VAAPI in VLC with or without "direct rendering".)
I can use VDPAU though. But:
- the resulting picture chroma is misplaced (either a decoder or SDL bug/misconfiguration)
- each call to
av_hwframe_transfer_data()takes between 30ms and 50ms (so it can't decode in real-time without adding latency)
So I can't make VAAPI work anymore. That's weird.
With recent updates in Debian packages, VAAPI now works (branch hw_dec_poc)… but av_hwframe_transfer_data() takes ~80ms :disappointed:
Update: https://github.com/Genymobile/scrcpy/issues/3800#issuecomment-1755466479