immich
immich copied to clipboard
Support hardware-accelerated decoding and tone-mapping
This is a tracking issue for adding hardware decoding and tone-mapping support for transcoding.
What
Hardware-accelerated decoding loads videos to an acceleration device to decode with the device's built-in support for certain codecs and formats. This differs from software decoding, where videos are instead loaded and decoded by a program.
Hardware-accelerated tone-mapping is similarly performed within the acceleration device, but takes place after decoding.
Why
Hardware decoding is good for a number of reasons.
- It's faster
- Accelerated decoding is naturally faster by virtue of its dedicated hardware optimization
- By keeping data in the acceleration device, it avoids starvation from the CPU not serving decoded data quickly enough
- Decoded video can be directly used by the acceleration device without needing to do a relatively expensive CPU->GPU transfer
- It avoids contention in cases where the CPU is concurrently doing other intensive work
- It reduces CPU load
- Since the CPU doesn't decode the video, the incurred load of decoding is very minimal
- Particularly on lower-end devices, the relative performance of the acceleration device compared to the CPU can be drastic, meaning that using the CPU with software decoding requires exerting it heavily in order to keep up with the device's encoding speed
Concerns
- Source videos come in many different forms, and it's tricky to know in advance whether the device can decode a given video (at least in JavaScript)
- Different APIs may expose different tone-mapping options and modes, so supporting the current settings in each API may require more effort
- Hardware tone-mapping is essentially a pre-requisite for hardware decoding
- Since hardware decoding loads videos to the device, software tone-mapping would require a GPU->CPU transfer after decoding followed by another CPU->GPU transfer after tone-mapping, the overhead of which defeats the point of acceleration
- While less relevant in recent years, hardware decoding can in some cases have lower quality than software decoding
Tasks
- [x] Ensure it uses software decoding if the user's hardware can't decode or tone-map a video
- [ ] Ensure that in cases of incompatibility, it still uses accelerated encoding rather than falling back entirely to software
- [x] Ensure that current tone-mapping options are available for each API where possible
- [x] Support Quick Sync
- [x] Support NVENC
- [ ] Support VAAPI
Just add my two cents about tone-mapping using Intel Quick Sync.
Calling ffmpeg with -vf "vpp_qsv=tonemap=1"
enables hardware accelerated tone-mapping with QSV, but it's only available when oneVPL is enabled when compiling FFmpeg. (When run FFmpeg ./configure
, just replace --enable-libmfx
with --enable-libvpl
, Intel oneVPL library is needed of course.) According to oneVPL dispatching behavior, I'm not sure whether this would work with Intel processor before Tiger Lake.
The version of FFmpeg we use is built with oneVPL, so that shouldn't be an issue, but it does seem like this wouldn't work if it dispatches to Media SDK. Jellyfin docs mention that the main advantage of QSV's tonemapping is lower power consumption, but otherwise OpenCL has wider hardware compatibility and is more customizable. Maybe that's the direction to go in that case.
Does it will apply for generate the thumbnails it could be good for big library of photos ?
No, it wouldn't have an effect on images. But for live/motion photos, the video portion of these would benefit.
I was curious about why immich even with hardware transcoding enabled was basically maxing out my 16 cpu cores even with only doing 1 transcode job. It also only is using 15% of my GPU render capability.
I went ahead and played with some of the ffmpeg options. Most of this is known but just adding my findings here:
Here is a sample immich ffmpeg call when using Intel QSV
ffmpeg -init_hw_device qsv=hw -filter_hw_device hw -i upload/upload/4ef.../...c2f.MOV -y -c:v hevc_qsv -c:a aac -movflags faststart -fps_mode passthrough -map 0:0 -map 0:1 -bf 7 -refs 5 -g 256 -v verbose -vf zscale=t=linear:npl=100,tonemap=hable:desat=0,zscale=p=bt709:t=bt709:m=bt709:range=pc,format=nv12,hwupload=extra_hw_frames=64,scale_qsv=1080:-1 -preset 7 -global_quality 23 upload/encoded-video/4ef.../...c7d.mp4
As stated in original post, we aren't using hardware decoding, by enabling this I see about a 5% reduction in CPU load.
I get a 5% improvement also by setting the preset
to fast
.
I am not super familiar with ffmpeg but the remainder of extra cpu load is coming from the filters. Is there a reason we need to do tone-mapping and all the zscale options? If I trim it down to the following, I get about 75% reduction in CPU load.
/usr/lib/jellyfin-ffmpeg/ffmpeg -init_hw_device qsv=hw -filter_hw_device hw -c:v hevc_qsv -i /config/test.MOV -y -c:v hevc_qsv -c:a aac -movflags faststart -fps_mode passthrough -map 0:0 -map 0:1 -bf 7 -refs 5 -g 256 -v verbose -vf format=nv12,hwupload=extra_hw_frames=64,scale_qsv=1080:-1 -preset fast -global_quality 23 /config/test_OUT.mp4
@rishid Thumbnail generation is still using CPU. If you don't have machine learning setup to use GPU, it will also uses CPU
Sure understood but specifically the single parent ffmpeg process, which is doing the video transcoding for encoded-videos, is the showing cpu usage of ~800% on my machine.
Unsure, perhaps passing through configuration is not right?
I completely forgot there are a lot of config knobs for Video Transcoder settings available in Immich - I think all my observations can be controlled already.
For Quick Sync, I got VPP tone-mapping working, but OpenCL doesn't work (something about not being able to allocate memory to the OpenCL device) and Vulkan is almost thrice as slow because it doesn't support zero-copy like it does for CUDA. VPP doesn't have the tone-mapping settings we use for other backends, but it is also the fastest option and tailored specifically for Intel devices. I can use that for QSV and let VAAPI use OpenCL (once I figure out how to get it to work).