jetson-ffmpeg
jetson-ffmpeg copied to clipboard
Request for hardware accelerated filtering
Everyone can agree this project is quite good at what it does so far. Decoding and encoding logic works great and only issues left are on nVidia's own included libraries. So we finally have something awesome and don't need to look back at gst ever again, right? Well... If you plan on doing anything more than just transcoding between formats unfortunately you're out of luck. Even simple scaling from 1080p to 720p on jetson nano will pin one of the CPU cores to 80%~100% load.
Tl;dr what I'm trying to say is it would be very nice to use some of that sweet GPU power for filtering the video. Gst has support for hardware accelerated filtering, so it should be possible to port to ffmpeg. If anyone would be willing take a crack at this, please start at scaling filter, something like npp_scale or even decoder scaling would be much appreciated.
I am definitely in favor. I would like something like scale_npp or cuda_yadif (deinterlace). But if that's possible, I have no idea.
If gstreamer can do it, so should ffmpeg. In gst resizing is done via caps for video codec.
@JezausTevas wouldn't scale_cuda work for this? https://ffmpeg.org/ffmpeg-filters.html#scale_005fcuda
@vectronic unfortunately Jetson nano does not have support for CUDA. If it had, there wouldn't be a need for custom implementation
please start at scaling filter, something like npp_scale or even decoder scaling would be much appreciated.
@JezausTevas wouldn't scale_cuda work for this? https://ffmpeg.org/ffmpeg-filters.html#scale_005fcuda
See this fork and pull request:
- https://github.com/Keylost/jetson-ffmpeg/pull/8
@vectronic unfortunately Jetson nano does not have support for CUDA. If it had, there wouldn't be a need for custom implementation
See this fork and pull request:
- https://github.com/Keylost/jetson-ffmpeg/pull/7
it would be very nice to use some of that sweet GPU power for filtering the video
For some hardware accelerated operations on Jetson see this API reference
- https://docs.nvidia.com/jetson/l4t-multimedia/classNvVideoConverter.html
Technically it is not GPU but dedicated chipset.
jetson_ffmpeg implementation uses Encoder/Decoder Video APIs, the Converter is same "family"
Apart from that there is ISP on Jetson but it is more for (pre-)processing raw data from the camera.
Thank you for your replies to my reply :-)
I am still trying to clarify in my head the whole Jetson/ffmpeg scenario...
@vectronic unfortunately Jetson nano does not have support for CUDA. If it had, there wouldn't be a need for custom implementation
Jetson Nano does support CUDA in general:
https://docs.nvidia.com/jetson/archives/r35.1/DeveloperGuide/text/AR/JetsonSoftwareArchitecture.html
https://docs.nvidia.com/cuda/archive/10.2/cuda-for-tegra-appnote/index.html
https://developer.nvidia.com/blog/simplifying-cuda-upgrades-for-nvidia-jetson-users/
So does this statement mean "Jetson Nano doesn't support CUDA based frame output from the nvmpi decoder" or "Jetson Nano doesn't have decode implemented in CUDA"?
I am using ffmpeg 6.0 and a fork of jetson-ffmpeg here: https://github.com/Keylost/jetson-ffmpeg
I am successfully using scale_cuda on a Jetson Nano:
ffmpeg -c:v h264_nvmpi -i in.mp4 \
-filter_complex "[0:v]hwupload_cuda[gpu];[gpu]scale_cuda=w=1200:h=1200[scaled];[scaled]hwdownload,format=yuv420p" \
-c:v h264_nvmpi out.mp4
This however suffers from the issue raised here related to excessive memory transfers:
https://github.com/jocover/jetson-ffmpeg/issues/67#issue-792081536
As far as I can tell, although the Jetson nvmpi codec uses the Jetson dedicated codec hardware blocks, the output of the decoder/input of the encoder should be able to be used directly with CUDA. This is due to the iGPU SOC architecture where both the GPU and CPU use of the same DRAM.
https://docs.nvidia.com/cuda/archive/10.2/cuda-for-tegra-appnote/index.html#memory-selection
There is an example of decoded frames being used by CUDA without extra copying here:
https://docs.nvidia.com/jetson/l4t-multimedia/l4t_mm_02_video_dec_cuda.html
Ideally the invocation would be something like:
ffmpeg -c:v h264_nvmpi -hwaccel cuda -hwaccel_output_format cuda -i in.mp4 \
-filter_complex "[0:v]scale_cuda=w=1200:h=1200" \
-c:v h264_nvmpi out.mp4
I am still exploring whether this would be possible by modifying the jetson-ffmpeg nvmpi codec implementation to use NvVideoConverter/EGLImage (?) as per the example code and hand the decoded frame to ffmpeg as CUDA (and the reverse for encoding).
Please (!) let me know if I am completely off track...
@vectronic you are correct. Try looking for the GST Jetson code implementation. Although it might not be open sourced by the NVidia yet. Gstreamer does scaling of the video pretty well on Jetson, but it is just horrific to work with compared to FFmpeg. Also while using Gstreamer I noticed multiple issues with hardware encoding where the video frames would randomly contain blocks of the previous frames, possibly erroneous memory management.
Try looking for the GST Jetson code implementation. Although it might not be open sourced by the NVidia yet
The source for gst-nvvideo4linux2
can be found here
- https://github.com/Extend-Robotics/gst-nvvideo4linux2
-
README.md
has instructions how it was extracted-
nvvidconv
scaling code can be extracted in the same way- look for
gst-nvvidconv_src.tbz2
- not sure if it will be useful in any way
- look for
-
-
User level NVIDIA GStreamer docs with scaling pipelines are here:
- https://docs.nvidia.com/jetson/archives/r35.2.1/DeveloperGuide/text/SD/Multimedia/AcceleratedGstreamer.html
The GSstreamer nvvidconv
, "VIC" hardware path is the same I linked in post above (same hardware, accessed through different layer of software)
There is also CUDA path example
The mentioned earlier pull request with scaling on decoder
- https://github.com/Keylost/jetson-ffmpeg/pull/8
Seems to be using this API
- https://docs.nvidia.com/jetson/l4t-multimedia/l4t_mm_transform_unit_sample.html
- which looks to be another way to access VIC
- apart from GStreamer nvvidconv and Jetson Multimedia API NVVideoConverter
- which looks to be another way to access VIC
Please (!) let me know if I am completely off track...
I believe your understanding is correct.
Some notes at the same time.
Hardware paths
- PVA/VIC (dedicated chipset)
- iGPU (and CUDA)
Jetson AGX Orin technical brief, Nano also has VIC.
PVA/VIC is the dedicated hardware for image processing. If it fits the use case I would prefer it over CUDA for better power efficiency and to leave GPU free for other tasks. You may access it through GStreamer nvvidconv and Jetson Multimedia API. The jetson-ffmpeg pull request I mentioned earlier seems to be using it for scaling also.
GStreamer vs FFMpeg
NVidia maintains GStreamer support for Jetson.
At the some time it changed the APIs a few times, broke older functionalities, introduced performance regressions with same code on newer platforms/L4T.
With FFmpeg this means the community struggling to keep the support functional.
If GStreamer works for your use case on Jetson prefer it over FFmpeg. At least NVidia takes responsibility to make it working and maintain in the long run.
I understand the need for FFmpeg on Jetson here, I need it myself.
Oh, hey guys. I was just working on this. Let me explain.
The Jetson multimedia engine returns decoded frames as a DMA file descriptor, essentially an NvBuffer, which contains a block-linear NV12 image. Ideally, this hardware handle would be the output, and other Jetson filters would work directly on this NvBuffer. This is not a CUDA device pointer, but can be mapped to a texture that can almost be read by CUDA kernels (as long as the kernels handle any necessary NV12->NV12_ER conversion manually). The VIC can operate directly on these NvBuffers though with no problem.
Perhaps, like the cuvid
decoder has the -hwaccel_output_format cuda
option, we could add -hwaccel_output_format
options like nvbuffer
, texture
and/or cuda
. The latter would be compatible with existing cuda filters, while the former would be optimal if new filters were created that worked directly with the NvBuffer or texture. Those new filters could then be created that use VPI to dispatch to the VIC (or PVA or GPU if specified via filter option).
The current state though, as you point out, is that nvmpi copies to CPU memory, so a hwupload
is required to use cuda filters.
My PR you linked that adds a -resize
option to the decoder was easy (I didn't have to add new interfaces) and cheap (I just modified an existing VIC operation). But this only works if all consumers want the resized video, so it's not a complete solution. Ideally we would output the dmabuf/nvbuffer hardware handle and filters specialized for jetson platforms would use it.