stash
stash copied to clipboard
[Feature] hardware encoding
i would love for nvidia NVenc for tarnscoding and Generation this would also work with amd and intel encoders this could speed up the Generation process
For the file transcoding we use x264 with the faster preset and thats probably the only place it might be quicker with nvenc (not sure about quality) BUT file transcoding imho is not that needed anymore since we now have live stream transcoding for unsupported files.( IMHO transcodes in the generated content section should be left unticked in 99% of the cases )
For live transcoding we have vp9 webm files that are not supported through hardware encoding except through intel vaapi i think and not sure about the stability / quality of that also.
Finally for the generated previews and markers we use x264 with preset veryslow to get the highest quality , since they are only generated once but viewed many times. If you wanted to make the generate faster thats where maybe we could opt to change the veryslow preset to medium or even fast and still get better quality/performance than hardware encoders. Thats ofcourse only for anyone that is willing to compromise the quality for speed and only as an extra selection not as the default.
The only way live streaming would even remotely be viable here is by hardware acceleration. Software-bound encoding is a no-go. VP9 is even worse. I am using a FX-6300. It was not optimized for these tasks, to put it kindly. The people asking for this feature need this. They do not care about the fabled and scary quality loss.
I'd add that after Pascal on the Nvidia side, hardware encoding with their GPUs is leaps and bounds better, comparable to CPU based x264 up to Medium presets I believe, while being much faster.
I too would be very pleased about this feature. It does not have to be as user friendly as for example jellyfin with its hardware encoding support. It could be an advanced setting to add parameters for ffmpeg (playback/live transcoding or preview generation). If it causes problems, users could just set it back to default. But advanced users would be able to fiddle with it a litte more.
It's quite easy to pass vaapi support to docker containers, and hardware encoding would greatly benefit my high cpu loads.
Can someone share command-line which is being used when generating such video previews. Only thing I see at the moment is this when generation fails on WMVs sometimes.
"ffmpeg.exe -v error -xerror -ss 85.32 -i F:\\Downloads\\1.wmv -t 0.75 -max_muxing_queue_size 1024 -y -c:v libx264 -pix_fmt yuv420p -profile:v high -level 4.2 -preset fast -crf 21 -threads 4 -vf scale=640:-2 -c:a aac -b:a 128k -strict -2 C:\\Users\\username\\.stash-data\\tmp\\preview013.mp4>: F:\\Downloads\\1.wmv: corrupt decoded frame in stream 1\r\n""
I would like to play around and at least see how GPU encode would help.
Took me 6 hours to generate video previews for 700 videos, each approximately 0,5-4GB. 20 pieces, 3% skip on both end, fast preset. i5 4460.
Can someone share command-line which is being used when generating such video previews. Only thing I see at the moment is this when generation fails on WMVs sometimes.
`"ffmpeg.exe -v error -xerror -ss 85.32 -i F:\Downloads\1.wmv -t 0.75 -max_muxing_queue_size 1024 -y -c:v libx264 -pix_fmt yuv420p -profile:v high -level 4.2 -preset fast -crf 21 -threads 4 -vf scale=640:-2 -c:a aac -b:a 128k -strict -2
That is the command for generating a preview segment. In your case that's run 20 times for each video, with the result spliced together into the final preview. You can cut down on generation time by choosing fewer segments, and setting encoding preset to ultrafast.
We're investigating hardware acceleration for transcoding, but I have no idea if it's going to be useful for generation seeing as hw acceleration likely has more startup latency.
Just tried NVENC in handbrake app just to see difference on some random file. After 1 minute CPU encoded simple 220MB 720p WMV file only 1min30sec far. In comparison RTX2070S managed to encode entire 5 minutes video in these 1 minute restrictions.
Currently can I create my own build to edit hardcoded command-line and make use of NVENC? Possible?
Currently can I create my own build to edit hardcoded command-line and make use of NVENC? Possible?
Seconded.
Any news on this?
This is how jellyfin handles gpu transcoding gui-wise
And this is the ffmpeg command
ffmpeg -vaapi_device /dev/dri/renderD128 -i file:"INPUT.mkv" -map_metadata -1 -map_chapters -1 -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 h264_vaapi -b:v 6621920 -maxrate 6621920 -bufsize 13243840 -force_key_frames:0 "expr:gte(t,0+n_forced*3)" -g 72 -keyint_min 72 -sc_threshold 0 -vf "format=nv12|vaapi,hwupload,scale_vaapi=w=1022:h=574:format=nv12" -start_at_zero -vsync -1 -codec:a:0 aac -ac 6 -ab 256000 -copyts -avoid_negative_ts disabled -f hls -max_delay 5000000 -hls_time 3 -individual_header_trailer 0 -hls_segment_type mpegts -start_number 0 -hls_segment_filename "OUTPUT.ts" -hls_playlist_type vod -hls_list_size 0 -y "SOMEPLAYLISTIDONTKNOW.m3u8"
It is very performant and easy on the cpu
Jellyfin uses a different player so hls is supported, thats not the case for stash as jwplayers hls support depends on the browser afaik. This issue makes it more complicated to adapt to.
For generating previews I found that this really doesn't help much. Since previews are converted only 0.75 seconds at a time, the overhead of creating and concatenating (twelve 0.75 clips) is probably a lot more than generating these individual bursts. Here's what my GPU graph looked like - notice only very sparse spikes of usage (as opposed to continuous usage when converting larger files), even with 12 parallel tasks, while the CPU was still 100% all the time (doing the preparing, other processing). Overall it did not help much.
If anyone wants to test change "-c:v", "libx264"
to hevc_nvenc
here.
There's a few subtle issues involved in hardware encoding beyond what's been mentioned here (I rambled about them a bit in https://github.com/stashapp/stash/issues/894#issuecomment-867616713):
- Hardware encoders are pickier about input formats, color spaces, etc.
- ffmpeg can handle the conversion, but that's in-software, so you're back to CPU-intensive work even with hardware encoding.
- Setting that up means keeping lists of all the formats each hardware encoder supports, comparing the format of the source file, and invoking inline conversion only when needed.
- Hardware encoding on consumer-grade GPUs are usually artificially limited to no more than N encodes at once (nvidia limits it to 2); this can be fixed by the user patching the driver, but it's an annoyance regardless. There's no way to auto-detect the current limit either; the drivers won't report it.
- Fallback to software needs to be implemented to handle cases where the hardware encoder fails (bogus input format, too many encodes in progress, solar flares, etc.).
Now on the plus side:
- Hardware decoding could potentially speed things up during hardware encoding if:
- the source format is supported by the decoder (hardware decoders usually do support more formats than the encoders), and
- the entire job can be done in a single invocation of ffmpeg (the biggest speedup comes from keeping all the work and data on the GPU, because that avoids some expensive copies to/from main/video memory). From my understanding stash currently invokes ffmpeg multiple times (once per desired segment), and invoking it a single time to do the same thing is slower because it slurps in the entire video instead of just seeking to each segment, so again this speedup might not be worth it unless a way can be found to get ffmpeg to be more efficient about this.
I don't think hardware decoding will help at all at the moment though given how ffmpeg is currently used. Reading compressed data and decoding it on-CPU versus initializing the GPU decoder, reading the compressed data, shipping it to GPU memory, waiting for the decode and then shipping the output back to main memory -- I think software-only is faster in that case.
I think the problem is not that the software decoder is bad, but for instance, I have files that will ramp up my CPU cores to 100% interfering with other services that also need those cores (the very same thing is true when transcoding with Plex on software).
I have a pretty old CPU (4790K) and it has a lot of trouble playing some files because the CPU simply can't keep up. The GPU however is a pretty decent one (GTX1070) and has no problem doing multiple 4K hardware transcodes simultaneously without my CPU ramping up to a 100%.
I understand that this is probably too hard to implement (or that people don't see the benefits of it) and thus will probably never come to Stash, but I wish it did though. Yes ofc I can transcode by generating the files, but that takes up diskspace.
About 1/3 of my library is HEVC, in either 720p/1080p. The software transcoder starts to struggle if I try outputting to anything higher than 720p. I use Firefox on everything, which doesn't support HEVC for licensing reasons, so it's always transcoding and tying up the host CPU.
I experimented with building Stash on top of the nvidia/cuda docker stack and was able to achieve hardware accelerated decoding and encoding. I'm pretty impressed with the results. I let a 1080p HEVC video stream in H264 for about 5 minutes - CPU load stayed around 1.00 while FFMPEG quickly filled the buffer and throttled the GPU. I noticed the biggest difference when using both NVDEC and NVENC - just enabling one didn't seem to effect CPU usage much. I'm using a GTX 1650 with a Ryzen 5 3600.
I don't know Golang, my changes are pretty hacky and this isn't robust enough for a PR. But it works as a proof of concept and I'm sure someone wiser can implement this properly. I did notice unintended behavior when accessing stash over a reverse proxy + SSL. FFMPEG would peg the GPU at 100% then fail after about 3 minutes of playing a video. This is probably due to my own nginx misconfiguration, it did not occur when accessing Stash directly.
Here is my modified Dockerfile from docker/build/x86_64/Dockerfile
.
I changed the video codec in pkg/ffmpeg/codec.go
on line 14:
VideoCodecLibX264 VideoCodec = "h264_nvenc"
And the ffmpeg arguments for StreamFormatH264 in pkg/ffmpeg/stream.go
starting on line 68. The "+" in front of frag_keyframe was strictly necessary I found, but the rest I tuned according to preference because the default quality was quite poor.
StreamFormatH264 = StreamFormat{
codec: VideoCodecLibX264,
format: FormatMP4,
MimeType: MimeMp4,
extraArgs: []string{
"-acodec", "aac",
"-pix_fmt", "yuv420p",
"-movflags", "+frag_keyframe+empty_moov",
"-preset", "llhp",
"-rc", "vbr",
"-zerolatency", "1",
"-temporal-aq", "1",
"-cq", "24",
},
}
Running make docker-build
after this should produce a Stash container capable of GPU encoding. For decoding, I set -hwaccel auto
as a setting in the interface under "FFmpeg LiveTranscode Input Args". Setting it globally like this broke the other transcode formats where hardware accelerated decoding is not possible (like WebM, the default transcode target). I commented out the WebM scene routes and endpoints in internal/api/routes_scene.go
as a workaround, so it always falls back to MP4.
One of the obstacles mentioned by @willfe was the transcode limit imposed by the Nvidia drivers. I didn't try this because my host is already patched, but the transcode limit patch can be integrated into docker containers so the user doesn't have to bother with it.
I think the missing piece to a possible all-in-one Stash container for hardware transcoding is the logic to determine when to use it, which is tricky depending on the particular architecture of GPU the user has - even with the Nvidia CUDA tools.
Edit: Wow, preview generation is almost instantaneous.
Would a similar technique allow for QuickSync transcoding?
Would a similar technique allow for QuickSync transcoding?
AFAIK QuickSync leverages LibVA, so as long as the host had the supporting libraries, it would be just a matter of exposing the video card to the container like this --device /dev/dri/render128
.
There is an open PR https://github.com/stashapp/stash/pull/3419 btw if anyone is interested in testing or providing some feedback
Would a similar technique allow for QuickSync transcoding?
AFAIK QuickSync leverages LibVA, so as long as the host had the supporting libraries, it would be just a matter of exposing the video card to the container like this
--device /dev/dri/render128
.
exactly what I was hoping you'd say
Edit: maybe getting ahead of myself, but this is a guide for exposing card with plex I used (shows commands to list available devices etc, and is synology specific but maybe works for others)
https://medium.com/@MrNick4B/plex-on-docker-on-synology-enabling-hardware-transcoding-fa017190cad7
I have an unusual Nas with a Rockchip RK3399 arm CPU. It does support hardware decoding with the H264_RKMPP HEVC_RKMPP decoders. I believe I need to compile ffmpeg myself to use these decoders which I have not bothered with yet.
Would it be possible to have a setting to specify the extra command line arguments for edge cases like this?
Great news, hardware encoding is now merged and ready for testing for anyone willing. It should work for:
- NVIDIA GPUS (h264_nvenc)
The docker image can be built with
make docker-cuda-build
, this makes the docker tagstash/cuda-build:latest
You will additionally need to specify the args:--runtime=nvidia --gpus all --device /dev/nvidiactl --device /dev/nvidia0
- Intel (h264_qsv, vp9_qsv)
For docker you must use the
CUDA
build and arg--device=/dev/dri
- Raspberry pi (newer) (h264_v4l2m2m)
- AMD Linux and most VAAPI supported platforms (h264_vaapi, vp9_vaapi) (hopefully)
For docker you must use the arg
--device=/dev/dri
Note that RPI and VAAPI dont support direct file transcode for h264 (mp4), so it only uses h264 hardware transcoding for HLS (h264).
Note that the normal Docker build only supports VAAPI and v4l2m2m
.
You can check the logs for which codecs where found and enabled, and check the debug log for why they failed
Having this enabled on my Unraid 6.11.5 Server (Intel Celeron J3455) reports back no available HW codecs.
23-03-10 13:10:57 Info [InitHWSupport] Supported HW codecs:
Plex is managing to use hw acceleration just fine, so not sure where to start looking here.
My docker-compose.yml already includes the device passthrough
devices:
- "/dev/dri/card0:/dev/dri/card0"
- "/dev/dri/renderD128:/dev/dri/renderD128"
Any idea/tips how to get more information for this?
Do you also have the intel-gpu-top plugin installed and have rebooted afterwards?
On Fri, Mar 10, 2023, 07:12 Max @.***> wrote:
Having this enabled on my Unraid 6.11.5 Server (Intel Celeron J3455) reports back no available HW codecs.
23-03-10 13:10:57 Info [InitHWSupport] Supported HW codecs:
Plex is managing to use hw acceleration just fine, so not sure where to start looking here.
My docker-compose.yml already includes the device passthrough
devices: - "/dev/dri/card0:/dev/dri/card0" - "/dev/dri/renderD128:/dev/dri/renderD128"
Any idea/tips how to get more information for this?
— Reply to this email directly, view it on GitHub https://github.com/stashapp/stash/issues/305#issuecomment-1463718748, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHROW435QUVDQVEQEIKKSDW3MLCLANCNFSM4KCN4KXQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Having this enabled on my Unraid 6.11.5 Server (Intel Celeron J3455) reports back no available HW codecs.
23-03-10 13:10:57 Info [InitHWSupport] Supported HW codecs:
Plex is managing to use hw acceleration just fine, so not sure where to start looking here.
My docker-compose.yml already includes the device passthrough
devices: - "/dev/dri/card0:/dev/dri/card0" - "/dev/dri/renderD128:/dev/dri/renderD128"
Any idea/tips how to get more information for this?
When stash starts, go into the webui->settings->logs
and set the log level to debug
, find the entry with codec h264_qsv
and send the specific error
Do you also have the intel-gpu-top plugin installed and have rebooted afterwards?
No, didn't see this in the docs or in the commit. Is it used by stash or just for debugging? As the linux on unraid servers has no package manager, it's kinda hard to build packages on your own for it.
When stash starts, go into the
webui->settings->logs
and set thelog level to debug
, find the entry with codech264_qsv
and send the specific error
Switching to debug or even trace shows nothing more from the server startup.
When starting stash the only hint for HWacc is [InitHWSupport] Supported HW codecs:
, when I try to live transcode it just works but as slow as it did with CPU only and the logs show nothing related to HWacc (tried with HLS, webm, dash all running slow apparently without HWacc)
2023-03-10 13:30:03
Debug
[transcode] starting transcode for 24d73d4def4e2e9ab797d46e28b1292c_dash-v_1080 at segment #0
2023-03-10 13:30:03
Debug
[transcode] starting transcode for 24d73d4def4e2e9ab797d46e28b1292c_dash-a_1080 at segment #0
2023-03-10 13:30:02
Debug
[transcode] starting transcode for 24d73d4def4e2e9ab797d46e28b1292c_dash-v_1080 at segment #0
2023-03-10 13:30:02
Debug
[transcode] starting transcode for 24d73d4def4e2e9ab797d46e28b1292c_dash-a_1080 at segment #0
2023-03-10 13:30:02
Debug
[transcode] returning DASH manifest for scene 4711
2023-03-10 13:29:53
Debug
[transcode] returning DASH manifest for scene 4711
2023-03-10 13:28:30
Debug
[transcode] starting transcode for 24d73d4def4e2e9ab797d46e28b1292c_hls at segment #0
2023-03-10 13:28:29
Debug
[transcode] returning HLS manifest for scene 4711
2023-03-10 13:28:10
Debug
[transcode] streaming scene 4711 as video/webm
2023-03-10 13:28:08
Debug
[transcode] streaming scene 4711 as video/webm
Do you also have the intel-gpu-top plugin installed and have rebooted afterwards?
No, didn't see this in the docs or in the commit. Is it used by stash or just for debugging? As the linux on unraid servers has no package manager, it's kinda hard to build packages on your own for it.
Afaik unraid doesn't have drivers for qsv by default? I'd been looking into it before putting my Plex install into a container and moving it over and came across this guide. Figured I'd probably have to do the same thing for this container, no? I've been asleep most of the time this release has been out so I haven't had a chance to try it with Stash.
~~https://forums.unraid.net/topic/77943-guide-plex-hardware-acceleration-using-intel-quick-sync/~~ sorry that's the original hard way to do it
https://forums.unraid.net/topic/131548-add-intel-igpu-qsv-quick-sync-encoding-to-official-plex-media-server-the-easy-way/
Just installed both GPU Statistics
and Intel-GPU-Top
apps for UnRAID and it said in the installation log Intel Kernel Module already enabled
, so I guess it already has drivers etc
The other part from the tutorial there is already done with my shared docker-compose.yml config where I passthrough the devices.
Still same results for stash after startup in the log
I use Unraid, had already the 2 plugins intalled : Intel GPU top and GPU statistics .
I set the devices :
devices: - "/dev/dri/card0:/dev/dri/card0" - "/dev/dri/renderD128:/dev/dri/renderD128"
But only my CPU is used :/ I don't have dev skills but I can test things if someone tells me to !
qsv
is not loaded for me with device passthrough in docker compose which works on jellyfin:
devices:
- /dev/dri:/dev/dri
Here is the stash log:
stash is running at http://localhost:9999/
2023-03-12 13:11:35
Info
stash is listening on 0.0.0.0:9999
2023-03-12 13:11:35
Info
stash version: v0.19.1-56-g9aa7ec57 - Official Build - 2023-03-10 22:31:13
2023-03-12 13:11:35
Info
[InitHWSupport] Supported HW codecs:
2023-03-12 13:11:35
Debug
[InitHWSupport] Codec vp9_vaapi not supported. Error output:
[AVHWDeviceContext @ 0x7fb07a26de00] Failed to initialise VAAPI connection: -1 (unknown libva error).
Device creation failed: -5.
Failed to set value '/dev/dri/renderD128' for option 'vaapi_device': I/O error
Error parsing global options: I/O error
2023-03-12 13:11:35
Debug
[InitHWSupport] Codec vp9_qsv not supported. Error output:
Device creation failed: -12.
Failed to set value 'qsv=hw' for option 'init_hw_device': Out of memory
Error parsing global options: Out of memory
2023-03-12 13:11:35
Debug
[InitHWSupport] Codec h264_v4l2m2m not supported. Error output:
[h264_v4l2m2m @ 0x7efe12e10880] Could not find a valid device
[h264_v4l2m2m @ 0x7efe12e10880] can't configure encoder
Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height
2023-03-12 13:11:35
Debug
[InitHWSupport] Codec h264_vaapi not supported. Error output:
[AVHWDeviceContext @ 0x7fd6cc337e00] Failed to initialise VAAPI connection: -1 (unknown libva error).
Device creation failed: -5.
Failed to set value '/dev/dri/renderD128' for option 'vaapi_device': I/O error
Error parsing global options: I/O error
2023-03-12 13:11:35
Debug
[InitHWSupport] Codec h264_qsv not supported. Error output:
Device creation failed: -12.
Failed to set value 'qsv=hw' for option 'init_hw_device': Out of memory
Error parsing global options: Out of memory
2023-03-12 13:11:35
Debug
[InitHWSupport] Codec h264_nvenc not supported. Error output:
Unrecognized option 'rc'.
Error splitting the argument list: Option not found
2023-03-12 13:11:34
Debug
Reading scraper configs from /root/.stash/scrapers
2023-03-12 13:11:34
Debug
Reading plugin configs from /root/.stash/plugins
2023-03-12 13:11:34
Info
using config file: /root/.stash/config.yml
Running ffmpeg -encoders
on host outputs h264_qsv
whereas within the shell in stash docker does not.
2023-03-12 13:11:35 Debug
[InitHWSupport] Codec h264_qsv not supported. Error output: Device creation failed: -12. Failed to set value 'qsv=hw' for option 'init_hw_device': Out of memory Error parsing global options: Out of memory
Multiple things here, firstly i checked alpine linux (what stash docker is built on), they dont compile any hardware codecs for ffmpeg.
You should try using the CUDA build which is built on Ubuntu and should have most hardware codecs.
Another thing is that it says Out of memory
, so i dont have high hopes switching to Ubuntu will work any better.
Binhex uses arch as their base image for Plex