ReplaySorcery
ReplaySorcery copied to clipboard
NVENC Support
The current implementation encodes the resulting video using x264, which can lead to unwanted stutter when playing CPU-intensive games. Would it be possible to offer NVENC-Support for those of us with nVidia-GPUs?
It actually uses jpeg encoding for memory compression and does a second run with x264 later. I did start initally using hardware encoding but I found the quality from nvenc on fast settings (to reduce resource usage) was really terrible compared to similar settings from VAAPI, and I also found VAAPI (on AMD atleast) to be a huge bottleneck despite low resource usage, most likely due the datarate limits of the GPU (sending frames to and encoded packets back).
Might still be a possibility in the future but I've found JPEG encoding to be very fast, barely use any resources, have less issues with hardware-quirks, and despite the bad name not that bad of quality loss.
I did not mean it as a replacement for the on-the-fly jpeg-encoding, just for saving afterwards. Encoding the frames with x264 for saving drives CPU usage up to around 12% on my Ryzen 7 3800x while the recording itself only uses around 3% CPU.
You might be right about the bottleneck though, I am not sure how well Nvidia GPUs fare in that regard.
Ah ok, might look into that at some point :+1:
It should be noted however that the x264 encoding could be sped up, there is a conversion routine (interleaved YUV -> I420) right before it that is slowing it down somewhat.
I'm not sure. While not the original intent there's something really nice about having a software solution that does not need backends for different platforms and does not have weird bugs or quality issues with different hardware or drivers.
I have added VA-API support and had a look into NVENC support however:
- I do not have a nVidia graphics card to test during development
- it seems like the CUDA filters for scaling/format-conversion don't support RGB to YUV conversion which is needed (maybe might be possible to scale/reformat in VA-API and send back to NVENC? Not sure if nVidia supports NVENC).
Nvidia GPUs using do support vdpau when using ffmpeg and libva-vdpau-driver provides the backend for VA-API through vdpau. Is there no way to use vdpau for nvidia GPUs since you've moved back to ffmpeg?
For example, if I try to run:
[AVHWDeviceContext @ 0x55c5767a07c0] libva: vaGetDriverNameByIndex() failed with unknown libva error, driver_name = (null)
[AVHWDeviceContext @ 0x55c5767a07c0] Failed to initialise VAAPI connection: -1 (unknown libva error).
Device creation failed: -5.
[vp8 @ 0x55c5767e5580] No device available for decoder: device type vaapi needed for codec vp8.
But if I change vaapi
to vdpau
, it works perfectly. Also, hw accelerated video decode in chromium-vaapi
works on Nvidia GPUs through vdpau if using libva-vdpau-driver-chromium.
Though of course, adding NVENC support would be the best, especially if it means we could use HEVC
The problem is vdpau
is a video decoding only API. The the VA-API wrapper around it thus only supports decoding.
Eek yeah that is a snag, huh. I didn't think about that part. Well I know you don't have any Nvidia hardware but if you decide to tackle NVENC I'm happy to test out whatever you might have.
On Wed, Dec 23, 2020 at 6:39 PM matanui159 [email protected] wrote:
The problem is vdpau is a video decoding only API. The the VA-API wrapper around it this only supports decoding.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/matanui159/ReplaySorcery/issues/9#issuecomment-750575491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM5Y33YTUXQFDCJFA52HBE3SWJ5UZANCNFSM4PH4YTKA .
I had a play around with this but everything I come up with, nVidia ends up being annoying and I hit a roadblock. Between capturing frames and encoding them, I also need to be able to scale and reformat (BGR to YUV) them. If the frames are coming from a hardware accelerated input (KMS), they also need to be cropped.
Credit where its due, NVENC supports BGR out of the box and will reformat for you. However, scaling requires having the Cuda compiler (from the nVidia SDK) installed, and cropping just does not seem possible. There is the NPP API but it does not support BGR frames.
Only options I can come up with are:
- Prevent hardware accelerated sources. This however would probably be a bit of a performance bottleneck since all the video frames would have to be uploaded to the GPU to be encoded (compared to copied on the GPU like KMS/VA-API currently do).
- Prevent cropping. If you have hardware acceleration with NVENC you have to make sure your version of FFmpeg supports Cuda which requires the nVidia SDK and it will always record the entire screen.
- Give up and say that until nVidia gets their act together and supports VA-API, hardware acceleration on nVidia proprietary APIs is not possible (open source nouveau drivers are supported).
Currently I have implemented (but not tested) option 2 in the nvenc
branch.
I have CUDA installed, I'll test it this afternoon.
On Fri, Dec 25, 2020 at 3:17 AM matanui159 [email protected] wrote:
I had a play around with this but everything I come up with, nVidia ends up being annoying and I hit a roadblock. Between capturing frames and encoding them, I also need to be able to scale and reformat (BGR to YUV) them. If the frames are coming from a hardware accelerated input (KMS), they also need to be cropped.
Credit where its due, NVENC supports BGR out of the box and will reformat for you. However, scaling requires having the Cuda compiler (from the nVidia SDK) installed, and cropping just does not seem possible. There is the NPP API but it does not support BGR frames.
Only options I can come up with are:
- Prevent hardware accelerated sources. This however would probably be a bit of a performance bottleneck since all the video frames would have to be uploaded to the GPU to be encoded (compared to copied on the GPU like KMS/VA-API currently do).
- Prevent cropping. If you have hardware acceleration with NVENC you have to make sure your version of FFmpeg supports Cuda which requires the nVidia SDK and it will always record the entire screen.
- Give up and say that until nVidia gets their act together and supports VA-API, hardware acceleration on nVidia proprietary APIs is not possible (open source nouveau drivers are supported).
Currently I have implemented (but not tested) option 2 in the nvenc https://github.com/matanui159/ReplaySorcery/tree/nvenc.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/matanui159/ReplaySorcery/issues/9#issuecomment-751205753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM5Y335JYOY76Y5U2BZFRU3SWRDAHANCNFSM4PH4YTKA .
@gardotd426 did you get around to testing this?
My GPU died I had to RMA it, my replacement GPU actually was shipped today. Will be here in a couple days will test then
On Mon, Jan 25, 2021, 4:08 PM matanui159 [email protected] wrote:
@gardotd426 https://github.com/gardotd426 did you get around to testing this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/matanui159/ReplaySorcery/issues/9#issuecomment-767113464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM5Y332AJTSI2DNMVEEZ3O3S3XMUPANCNFSM4PH4YTKA .
Any update on this? NVENC support is a big deal for me which is why I'm using OBS replay buffer currently.
I also have a Nvidia GPU and would be happy to help test but I have no experience with any of this so I would need a written step by step exactly what you would need done.
Unfortunately the ffmpeg-cuda
package in the AUR is out of date and orphaned, so it won't be fixed any time soon. However, I pulled down the PKGBUILD, and used asp
to checkout the official ffmpeg PKGBUILD for the official repo ffmpeg package, and comparing the two PKGBUILDs it seems like I might be able to build the official ffmpeg but with full cuda support by changing/adding 5-6 lines. If I can get it to build, I can test the nvenc
branch.
It would be easy if nVidia supported VA-API :stuck_out_tongue:
The open source Nouveau drivers support VA-API and this work with hardware acceleration so there is that option.
CUDA seems to be difficult to support. I haven't updated the nvenc
branch in a while so it may be out of date. I am tempted to try option 1 instead to see the performance implications but it would require some changing over how hardware acceleration is enabled and implementation is chosen.
I wonder if anyone has worked on a NVENC/CUDA to VA-API emulation layer :thinking:
Besides NvENC you could also look at NvFBC for full screen capture?
This guy has a working implementation for OBS: https://gitlab.com/fzwoch/obs-nvfbc
You can patch the drivers to support NvFBC: https://github.com/keylase/nvidia-patch
I already have KMS for full screen capture. The issue is the steps inbetween (cropping, scaling, pixel format conversion) which is more difficult because I don't have a nVidia card and FFmpeg doesn't come with Cuda on most distros
Could you maybe enable a build flag or something that would allow the user to use ffmpeg if their distro comes with CUDA (or they've built ffmpeg themselves)?
Anyway, I've got ffmpeg w/ cuda support being built now, so I'll try that nvenc branch if ffmpeg builds successfully.
So I've tried and it fails, and it seems the problem is that the Nvidia drivers don't use drm-kms for modesetting.
From an nvidia employee:
the NVIDIA X driver predates drm and doesn’t use drm-kms for modesetting.
If you want to capture the contents of the display in an efficient way, your best bet is probably to use the NVIDIA Capture SDK
For this reason, kmsgrab does not work on Nvidia. So KMS is out.
Anyway, w/ ffmpeg-cuda:
replay-sorcery
nvenc branch with encoder set to nvenc with videoInput = hwaccel
:
FFmpeg version: n4.4
[kmsgrab @ 0x55906218bb80] No usable planes found.
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/device/ffdev.c:119 (rsFFmpegDeviceOpen)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/device/kmsdev.c:47 (rsKmsDeviceCreate)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/device/device.c:50 (rsVideoDeviceCreate)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/main.c:162 (main)
Unused option: framerate
Failed to create KMS device: Invalid argument
Function not implemented
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/main.c:218 (main)
videoInput = kms
FFmpeg version: n4.4
[kmsgrab @ 0x55bcd8cadb80] No usable planes found.
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/device/ffdev.c:119 (rsFFmpegDeviceOpen)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/device/kmsdev.c:47 (rsKmsDeviceCreate)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/main.c:162 (main)
Unused option: framerate
Invalid argument
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/main.c:218 (main)
x11
:
FFmpeg version: n4.4
X11 version: 11.0
X11 vendor: The X.Org Foundation v12013000
[h264_nvenc @ 0x55e3d214f940] Filter graph: hwmap=derive_device=cuda,scale_cuda=2560:1440
[Parsed_hwmap_0 @ 0x55e3d216af00] Mapping requires a hardware context (a device, or frames on input).
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/encoder/ffenc.c:285 (rsFFmpegEncoderOpen)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/encoder/nvenc.c:52 (rsNVEncoderCreate)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/main.c:165 (main)
[Parsed_hwmap_0 @ 0x55e3d216af00] Failed to configure output pad on Parsed_hwmap_0
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/encoder/ffenc.c:285 (rsFFmpegEncoderOpen)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/encoder/nvenc.c:52 (rsNVEncoderCreate)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/main.c:165 (main)
[AVFilterGraph @ 0x55e3d216b6c0] Failed to configure filter graph: Invalid argument
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/encoder/ffenc.c:286 (rsFFmpegEncoderOpen)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/encoder/nvenc.c:52 (rsNVEncoderCreate)
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/main.c:165 (main)
Unused option: qp
Invalid argument
- /home/matt/nvme2/dev/replaysorcery/bin//home/matt/dev/replaysorcery/src/main.c:218 (main)
Obviously I've confirmed that cuda is working with ffmpeg with ffmpeg -hwaccel cuda -f x11grab -s 2560x1440 -i :0 -c:v h264_nvenc output.mp4
. I also have tested full-hardware transcode, both work.
That is with x11 grabbing though. Preferably for performance/resource-usage reasons we would be able to grab the frames o the GPU and crop/scale/convert/encode all on the GPU like we currently do with KMS and VA-API. Both of which are supported by AMD, Intel and the open source Nouveau drivers (so it is possible to do on nVidia). It is just the nVidia proprietary drivers and their stubborn refusal of open source and standards that's making this difficult.
I'm thinking of giving in and just using NVENC with grabbing/cropping/scaling/converting all done in software.
That is with x11 grabbing though.
x11grab works on Nvidia, it's kmsgrab that doesn't. Unless I'm misunderstanding what you're saying there.
I'm thinking of giving in and just using NVENC with grabbing/cropping/scaling/converting all done in software.
That could at least maybe be a stopgap for NV users until Nvidia comes around or a better solution is found
x11grab works on Nvidia, it's kmsgrab that doesn't. Unless I'm misunderstanding what you're saying there.
X11 grabbing is done in software, KMS grabbing is done in hardware. Hence the rant above
Ohhhh you were referring to me saying:
Obviously I've confirmed that cuda is working with ffmpeg with
ffmpeg -hwaccel cuda -f x11grab -s 2560x1440 -i :0 -c:v h264_nvenc output.mp4
. I also have tested full-hardware transcode, both work.
Yeah I know that was with x11 grabbing, I was just including that to demonstrate that I did have a working ffmpeg with cuda support enabled.
How much would "grabbing/cropping/scaling/converting all done in software" cost in overhead?
When I tried it with AMD (before KMS support) the overhead in CPU/GPU usage was very minimal but it still slowed down most games. My theory at the time was the amount of bus used to send raw frames to the GPU.