Sunshine Better Raspberry Pi server performance

Description

Now I reveal what I really want to use Sunshine for. As a server on the Raspberry Pi! Why would I want such a thing? Surely it makes more sense as a client? Normally yes, but when combined with the PiStorm project, things get very interesting.

As you might imagine, PiStorm is very CPU-intensive, so for this to be feasible, Sunshine needs to use as little CPU as possible. The first step here was obviously to get hardware video encoding to work. The Pi does not support VAAPI or CUDA, but fortunately, this still turned out to be very easy.

These initial changes to add a V4L2M2M encoder did not work for me at first, as Sunshine claimed that an IDR frame was not produced. Digging around in the internals, it looked very much to me like requesting IDR frames should work on the Pi. As a shot in the dark, I applied John Cox's ffmpeg patchset for the Raspberry Pi. This patchset, which I recently applied to Gentoo's ffmpeg package, enables efficient zero-copy video playback on the Pi. With this, I have seen 1080p videos go from a stuttery mess to being buttery smooth. Being playback-focused, I really didn't expect it to help, but I was delighted when it suddenly sprang to life!

[2024:02:25:17:15:54]: Info: Found H.264 encoder: h264_v4l2m2m [V4L2M2M]
[2024:02:25:17:15:54]: Info: Executing [Desktop]
[2024:02:25:17:15:54]: Info: CLIENT CONNECTED
[2024:02:25:17:15:54]: Warning: No render device name for: /dev/dri/card1
[2024:02:25:17:15:55]: Error: Couldn't expose some/all drm planes for card: /dev/dri/card0
[2024:02:25:17:15:55]: Info: Screencasting with KMS
[2024:02:25:17:15:55]: Warning: No render device name for: /dev/dri/card1
[2024:02:25:17:15:55]: Info: Found monitor for DRM screencasting
[2024:02:25:17:15:55]: Info: Found connector ID [32]
[2024:02:25:17:15:55]: Info: Found cursor plane [309]
[2024:02:25:17:15:55]: Info: SDR color coding [Rec. 601]
[2024:02:25:17:15:55]: Info: Color depth: 8-bit
[2024:02:25:17:15:55]: Info: Color range: [MPEG]
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160]  <<< v4l2_encode_init: fmt=0/0
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] Using device /dev/video11
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] driver 'bcm2835-codec' on card 'bcm2835-codec-encode' in mplane mode
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] requesting formats: output=YU12/yuv420p capture=H264/none

The quality isn't fantastic though, and it's still using 275% CPU. I utilised gprof to find where it's spending all the effort.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 51.88     10.78    10.78                             ff_hscale16to15_X4_neon_asm
 18.48     14.62     3.84                             ff_yuv2planeX_8_neon
 13.47     17.42     2.80   156694     0.00     0.00  bgr32ToUV_half_c
 11.98     19.91     2.49   155935     0.00     0.00  bgr32ToY_c
  0.67     20.05     0.14                             ff_hscale16to15_4_neon_asm
  0.53     20.16     0.11      142     0.00     0.00  std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > > std::__copy_move_a1<false, char const*, std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > >
 >(char const*, char const*, std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > >)
  0.43     20.25     0.09      284     0.00     0.02  scale_internal
  0.38     20.33     0.08    94424     0.00     0.00  chr_planar_vscale
  0.29     20.39     0.06    38454     0.00     0.00  chr_convert
  0.29     20.45     0.06    38383     0.00     0.00  chr_h_scale
  0.24     20.50     0.05      577     0.00     0.00  yuv2planeX_8_c
  0.19     20.54     0.04    60422     0.00     0.00  lum_convert
  0.19     20.58     0.04        1     0.04     2.95  video::capture_async(std::shared_ptr<safe::mail_raw_t>, video::config_t&, void*)
  0.14     20.61     0.03     2133     0.00     0.00  lumRangeToJpeg_c
  0.14     20.64     0.03                             _init
  0.10     20.66     0.02   103963     0.00     0.00  lum_planar_vscale
  0.10     20.68     0.02       24     0.00     0.00  alloc_gamma_tbl
  0.05     20.69     0.01   955063     0.00     0.00  av_pix_fmt_desc_get
  0.05     20.70     0.01    59959     0.00     0.00  lum_h_scale
  0.05     20.71     0.01     6502     0.00     0.00  obl_axpy
  0.05     20.72     0.01     2148     0.00     0.00  chrRangeToJpeg_c
  0.05     20.73     0.01     2081     0.00     0.00  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
  0.05     20.74     0.01      483     0.00     0.00  av_frame_unref
  0.05     20.75     0.01      376     0.00     0.00  stream::control_server_t::call(unsigned short, stream::session_t*, std::basic_string_view<char, std::char_traits<char> > const&, bool)
  0.05     20.76     0.01        3     0.00     0.00  video::avcodec_encode_session_t::request_idr_frame()
  0.05     20.77     0.01                             av_bprint_escape
  0.05     20.78     0.01                             ff_hscale16to19_X4_neon_asm
  0.00     20.78     0.00   463475     0.00     0.00  ff_hscale16to15_X4_neon
  0.00     20.78     0.00   103794     0.00     0.00  ff_rotate_slice
  0.00     20.78     0.00    28314     0.00     0.00  av_opt_next
  0.00     20.78     0.00    11496     0.00     0.00  av_bprint_init
  0.00     20.78     0.00     9141     0.00     0.00  av_buffer_unref
  0.00     20.78     0.00     7378     0.00     0.00  glad_gl_get_proc_from_userptr
  0.00     20.78     0.00     7306     0.00     0.00  enet_list_clear
  0.00     20.78     0.00     7184     0.00     0.00  enet_protocol_send_outgoing_commands
  0.00     20.78     0.00     6975     0.00     0.00  enet_time_get
  0.00     20.78     0.00     6812     0.00     0.00  config::whitespace(char)
  0.00     20.78     0.00     6433     0.00     0.00  ff_hscale16to15_4_neon

This is not my area of expertise, but it looks like finding the right format might be the key here. I'd appreciate any help you can provide here. I know that John Cox's patchset adds support for Pi-specific SAND formats, but I don't know whether they are usable in this context.

Type of Change

[ ] Bug fix (non-breaking change which fixes an issue)
[X] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Dependency update (updates to dependencies)
[ ] Documentation update (changes to documentation)
[ ] Repository update (changes to repository files, e.g. .github/...)

Checklist

[X] My code follows the style guidelines of this project
[X] I have performed a self-review of my own code
[X] I have commented my code, particularly in hard-to-understand areas
[X] I have added or updated the in code docstring/documentation-blocks for new or existing methods/components

Branch Updates

LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch must be updated before it can be merged. You must also Allow edits from maintainers.

[X] I want maintainers to keep my branch updated

Feb 25 '24 17:02 chewi

As a server on the Raspberry Pi!

This is a Pi 4, I assume? I don't think the Pi 5 has any hardware encoders anymore.

This is not my area of expertise, but it looks like finding the right format might be the key here. I'd appreciate any help you can provide here. I know that John Cox's patchset adds support for Pi-specific SAND formats, but I don't know whether they are usable in this context.

Yeah, it's all in the RGB->YUV color conversion code, which is expected since it's doing all the color conversion on the CPU. I guess it's nice that's multi-threaded now. You can adjust the "Minimum CPU Thread Count" on the Advanced tab in the UI if you want to play with the amount of concurrency there.

What your encoding pipeline looks like now: RGB framebuffer DMA-BUF from KMS capture -> import to EGL (eglCreateImage) -> readback from EGL to CPU (glGetTextureSubImage) -> RGB to YUV conversion and scaling (libswscale) -> upload to DMA-BUF again -> encode the DMA-BUF

What you want is more like what we do with VAAPI: RGB framebuffer DMA-BUF from KMS capture -> import to EGL (eglCreateImage) -> render using color conversion shaders into another DMA-BUF -> pass that DMA-BUF (AV_PIX_FMT_DRM_PRIME) to h264_v4l2m2m.

Most of that pipeline is simple and already written in Sunshine. The tricky part will be getting that second DMA-BUF to write into and/or exporting the render target as a DMA-BUF. Since there's no standard way to create a DMA-BUF, that part tends to be highly API-specific. For VAAPI, we import the underlying DMA-BUF of the VA surface as the render target for our color conversion. For CUDA, we create a blank texture to use as the render target and use the CUDA-GL interop APIs to import that texture as a CUDA resource for NVENC to read.

Where to start is probably writing something like this for AV_HWDEVICE_TYPE_DRM and using that in your encoder.

Then for your encoder definition you probably want something like this:

    std::make_unique<encoder_platform_formats_avcodec>(
      AV_HWDEVICE_TYPE_DRM, AV_HWDEVICE_TYPE_NONE,
      AV_PIX_FMT_DRM_PRIME,
      AV_PIX_FMT_NV12, AV_PIX_FMT_P010,
      drm_init_avcodec_hardware_input_buffer),

Since FFmpeg's hwcontext_drm.c doesn't support frame allocation, you'll need to figure out how to do that and provide a buffer pool for frame allocation.

Finally, for encoding side, you'll want to do something similar to what I did in 8182f592e8386b714c35772ca3651547d5001e5a for supporting KMS->GL->CUDA with the gl_cuda_vram_t and make_avcodec_gl_encode_device.

Feb 25 '24 20:02 cgutman

Many thanks for the detailed reply. Sounds like this could be an interesting exercise. I may be wrong, but I think playback scenarios have managed to avoid GL altogether. What Kodi calls Direct to Plane and mpv calls HW-overlay? Is that not possible here?

Feb 25 '24 20:02 chewi

I think that color conversion hardware is only accessible on the scanout path (and it's YUV->RGB, not RGB->YUV). Some encoders do have the ability to accept RGB frames and perform the conversion to YUV internally (using dedicated hardware or a shader), but I don't think the Pi's encoder supports RGB input.

Feb 25 '24 21:02 cgutman

Thanks for those pointers.

My initial PoC seemingly needed John Cox's ffmpeg patchset for the Raspberry P. It's a rather heavy patchset, but Gentoo isn't the only party invested in keeping it updated. John does a good job by himself anyway. Whether it will be needed in the end will depend on the architecture we go for.

I did spend quite a long time looking into this after cgutman gave me some pointers. I was really struggling with the DMA-BUF part of it, as v4l2m2m seems to work quite differently to VAAPI.

I also considered doing it a different way, using the Pi's ISP for the pixel format conversion. ffmpeg has some support for it already. This might be a simpler and even more efficient, but it would also be Pi-specific. v4l2m2m seems preferable, as it is supported by many SoCs.

It's been a while since I had time to work on this. It's something I'd really like to go, but Gentoo maintenance usually takes priority.

Jun 16 '24 14:06 chewi

Understood. I will convert this to a draft for now, whenever you are ready feel free to mark it as ready for review again.

Jun 16 '24 14:06 ReenigneArcher