scrcpy icon indicating copy to clipboard operation
scrcpy copied to clipboard

[RFC] Initial draft for Linux hardware-accelerated decoding with VA-API

Open arter97 opened this issue 5 years ago • 11 comments

Addresses #1672

This pull-request adds support for hardware-accelerated decoding with VA-API on Linux platform.

I'm about 70% certain that this code can be better optimized, but even with this initial draft I'm seeing much lower CPU usage overall.

Few limitations here:

  • VA-API/NV12 is hardcoded; we need a switch and auto-detection method.
  • Rotation is broken.
  • !SCRCPY_LAVF_HAS_NEW_ENCODING_DECODING_API is not taken care of.
  • From a quick glimpse, it looks like it's easy to extend this to other H/W accelerated methods such as VDPAU or CUDA.
  • CPU usage is still a bit high despite the fact that it's using VA-API.
  • Reducing memory copy? I'm not sure if this is zero-copy or not, or whether zero-copy is even possible under VA-API and SDL.

Any suggestions would be nice :)

arter97 avatar Nov 09 '20 12:11 arter97

Wow, thank you for that :+1:

I tested the branch, but unfortunately on my computer (it probably depends on the computer):

  • it does not really improve CPU usage
  • globally, the latency is worse (did not measure though, but the latency increase is sometimes obvious when I open an Android app)
  • there is a visual glitch on start (but it's not important)

I guess (but I'm not sure) that the performance issues come from the fact that the decoded image is "downloaded" to main memory (av_hwframe_transfer_data()/av_image_fill_arrays()).

To avoid this problem in VLC, the OpenGL video output imports the hardware picture via "interops". For example for VAAPI: https://code.videolan.org/videolan/vlc/-/blob/56c05e47af966d53cb9d32d0b8d7c33f11cd6fc6/modules/video_output/opengl/interop_vaapi.c

But scrcpy just uses SDL and does not implement OpenGL/Metal/DirectX directly (because that's too much work).

rom1v avatar Nov 09 '20 12:11 rom1v

Wow, thank you for that

Thanks :smiley:

I tested the branch, but unfortunately on my computer (it probably depends on the computer):

  • it does not really improve CPU usage
  • globally, the latency is worse (did not measure though, but the latency increase is sometimes obvious when I open an Android app)
  • there is a visual glitch on start (but it's not important)

Yeah, seems like it really depends on everything.

On my laptop(i7-8550U) with Ubuntu 20.04:

  • The CPU usage comes down from 140% to 80%(still high, right?) (I thought leaving the phone on the camera app would stress video decoder the most but seems like scrolling back and forth in the app drawer does even more.)
  • I do get inconsistent latencies sometimes, but overall it's imperceptible between the original master branch. And when I do get that inconsistent latencies, I think it slows down by 100-200ms, but it doesn't appear to skip frames.
  • I do not get a visual glitch on start on both my OnePlus 7 Pro(Snapdragon 855) and 8 Pro(865).

On that note though, when I was working on this initially(probably 18 hours ago), it would totally break on a Samsung Galaxy S10e with Exynos processor.

My guess is that even more care for different processors are required..? I'll try again when I get back home and play around with Exynos.

I guess that the performance issues come from the fact that the decoded image is "downloaded" to main memory (av_hwframe_transfer_data()/av_image_fill_arrays()).

To avoid this problem in VLC, the OpenGL video output imports the hardware picture via "interops". For example for VAAPI: https://code.videolan.org/videolan/vlc/-/blob/56c05e47af966d53cb9d32d0b8d7c33f11cd6fc6/modules/video_output/opengl/interop_vaapi.c

But scrcpy just uses SDL and does not implement OpenGL/Metal/DirectX directly (because that's too much work).

This is my first time working on calling FFmpeg/libav* functions directly haha

I should probably look into SDL a bit more. Referencing LookingGlass might give some pointers too, seems like it's also using SDL.

Thanks for the comment :+1:

arter97 avatar Nov 09 '20 13:11 arter97

Found something interesting.

After upgrading the graphics stack and X server to 1.20.9(Ubuntu 20.10's build), the CPU usage was reduced even more. https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers https://launchpad.net/ubuntu/+source/xorg-server/2:1.20.9-2ubuntu1

S/W decoding: 120% H/W decoding: 50%

arter97 avatar Nov 09 '20 14:11 arter97

it's great to see this being worked on, @arter97 thanks :tada:

tested it & saw significant save in resources, and for me when running both side-by-side, latency seems lower with vaapi.

S/W decoding: ~70% CPU H/W decoding: ~20% CPU

on fedora 33 using intel-media-driver

vainfo: VA-API version: 1.9 (libva 2.9.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 20.3.0 ()

rajveermalviya avatar Nov 09 '20 17:11 rajveermalviya

Yeah, seems like something's not right with Exynos.

Both my Galaxy S9 and S10e's video cannot be decoded with this VA-API code. Pixel format doesn't return AV_PIX_FMT_VAAPI for some reason.

INFO: scrcpy 1.16 <https://github.com/Genymobile/scrcpy>
build_release//server/scrcpy-server: 1 file pushed, 0 skipped. 101.0 MB/s (33694 bytes in 0.000s)
[server] INFO: Device: samsung SM-G965N (Android 10)
INFO: Renderer: opengl
INFO: OpenGL version: 4.6 (Compatibility Profile) Mesa 20.3.0-devel (git-fb1793b 2020-11-08 focal-oibaf-ppa)
INFO: Trilinear filtering enabled
INFO: Initial texture: 1440x2960
ERROR: Unable to decode using VA-API
ERROR: Could not send video packet: -1094995529
ERROR: Could not process frame
WARN: Device disconnected

Video is hard lol

arter97 avatar Nov 10 '20 10:11 arter97

It seems that this is unable to detect my version of VA-API. Here is the scrcpy output and here is my vainfo.

./run x output:

INFO: scrcpy 1.16 <https://github.com/Genymobile/scrcpy>
x/server/scrcpy-server: 1 file pushed. 3.6 MB/s (33622 bytes in 0.009s)
[server] INFO: Device: samsung SM-G973U (Android 10)
INFO: Renderer: opengl
INFO: OpenGL version: 4.6 (Compatibility Profile) Mesa 20.0.8
INFO: Trilinear filtering enabled
INFO: Initial texture: 1080x2280
ERROR: Unable to decode using VA-API
ERROR: Could not send video packet: -1094995529
ERROR: Could not process frame
WARN: Device disconnected

vainfo output:

libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_7
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.7 (libva 2.6.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 20.1.1 ()
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            :	VAEntrypointVLD
      VAProfileMPEG2Main              :	VAEntrypointVLD
      VAProfileH264Main               :	VAEntrypointVLD
      VAProfileH264Main               :	VAEntrypointEncSliceLP
      VAProfileH264High               :	VAEntrypointVLD
      VAProfileH264High               :	VAEntrypointEncSliceLP
      VAProfileJPEGBaseline           :	VAEntrypointVLD
      VAProfileJPEGBaseline           :	VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline:	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointEncSliceLP
      VAProfileVP8Version0_3          :	VAEntrypointVLD
      VAProfileHEVCMain               :	VAEntrypointVLD
      VAProfileHEVCMain10             :	VAEntrypointVLD
      VAProfileVP9Profile0            :	VAEntrypointVLD
      VAProfileVP9Profile2            :	VAEntrypointVLD

inxi -Fxz output:

  Device-1: Intel UHD Graphics 620 vendor: Dell driver: i915 v: kernel 
  bus ID: 00:02.0 
  Display: x11 server: X.Org 1.20.8 driver: modesetting unloaded: fbdev,vesa 
  resolution: 1920x1080~60Hz 
  OpenGL: renderer: Mesa Intel UHD Graphics 620 (KBL GT2) v: 4.6 Mesa 20.0.8 
  direct render: Yes 

regulardude400 avatar Dec 31 '20 17:12 regulardude400

@arter97 If you're still interested in working on this (rebase/rewrite on dev branch and fix issues), we could enable it with an explicit option (--hw-dec) (I prefer to keep it disabled by default for now, because it will probably cause issues depending on the machine).

rom1v avatar Feb 07 '22 10:02 rom1v

@rom1v Hi.

Unfortunately, I don't think I can work on scrcpy due to my current occupation. This was more of a POC. I expect a much better implementation could be had by someone with deeper knowledge than me.

Thanks.

arter97 avatar Feb 17 '22 20:02 arter97

@arter97 OK, no problem.

I'll work on it in the future (your PoC will help) :wink:

rom1v avatar Feb 17 '22 20:02 rom1v

I rewrote the PoC on current dev branch: hw_dec_poc.

With VAAPI, get_hw_format is called with AV_PIX_FMT_VAAPI in the pix_fmts list, so I return AV_PIX_FMT_VAAPI. But then it is called once again without AV_PIX_FMT_VAAPI:

DEBUG: == get_hw_format ==
DEBUG: ==== vdpau (100)
DEBUG: ==== cuda (119)
DEBUG: ==== vaapi_vld (46)
DEBUG: == get_hw_format ==
DEBUG: ==== vdpau (100)
DEBUG: ==== cuda (119)
DEBUG: ==== yuv420p (0)
DEBUG: ==== yuv420p (0)

So I can't make VAAPI work anymore. That's weird. (Note that I can use VAAPI in VLC with or without "direct rendering".)

I can use VDPAU though. But:

  • the resulting picture chroma is misplaced (either a decoder or SDL bug/misconfiguration)
  • each call to av_hwframe_transfer_data() takes between 30ms and 50ms (so it can't decode in real-time without adding latency)

rom1v avatar Mar 14 '22 20:03 rom1v

So I can't make VAAPI work anymore. That's weird.

With recent updates in Debian packages, VAAPI now works (branch hw_dec_poc)… but av_hwframe_transfer_data() takes ~80ms :disappointed:

rom1v avatar Mar 29 '22 14:03 rom1v

Update: https://github.com/Genymobile/scrcpy/issues/3800#issuecomment-1755466479

rom1v avatar Oct 10 '23 13:10 rom1v