Multi-Plexer
Multi-Plexer copied to clipboard
Cuda accelerated tonemap Filter
Initial idea of using POCL as a cuda translation layer isnt viable because of POCL not working with image formats on cuda.
Currently reaching out to Yasroslav Pogrebnyak, the developer of the VF_overlay_cuda ffmpeg filter.
Reaching out to nyanmisaka. Seem's to have a lot of experience working on FFmpeg filters and frankly knows more than I do.
In addition to this, collaborating with Ed Borasky to confirm function on jetson platforms.
vf_tonemap_cuda.txt (renamed from .c to .txt to make github happy )
Missing: tonemap.cu with proper kernel side code. this is easy once I know how to properly call the cuda kernel side from the ffmpeg side.
Standard stride blocks should work, define total amount of blocks using height. most resolution will be 16:9, so by using height parameter, we have a higher chance of hitting divisible by 3 cleanly, so we can take advantage of cuda language data structure.
Other option is taking the R G and B value of a given pixel which is guaranteed to be *3. this might also help for other tone mapping algorithms that use relative offset from local peak luma as input for tonemapping output
major bodge, but the new version of /cuda_filter/vf_scale_cuda.cu should be able to do rein hard tonemapping. the below is an extract from the ffmpeg devel mailing list relevant to this:
For ease of developement, I've kept everything the same including the name of the filter, only changing the function within the file. This is very much a bodge to facilitate development. As such, for testing, this file should replace the vf_scale_cuda.cu file in ffmpeg/libavfilter/vf_scale_cuda.cu
FFmpeg should then be compiled as standard for cuda filters and should be called as you would call the standard vf_scale_cuda filter. The command would be similar to: ffmpeg -y -vsync 0 -hwaccel cuda -hwaccel_output_format cuda -i input.mp4 -vf scale_cuda=Source_width:Source_Height -c:a copy -c:v h264_nvenc -b:v 5M output.mp4
The above should decode in hardware, tonemap the frame on gpu and re-encode in hardware at a given bitrate.
Used overlay filter as base instead of scale- seems much better for my purposes. reach out to @znmeb to ask for a test on his side.
syntax will be: ffmpeg -i INPUT -i INPUT -filter_complex 'hwupload_cuda,overlay_cuda' OUTPUT
This is a bodge for now, since it's only modifying the output of the cuda kernel itself
to use it, replace ffmpeg/libavfilter/vf_overlay_cuda.cu with this file: https://github.com/Camofelix/Jetson_ffmpeg_trancode_cluster/blob/master/cuda_filter/vf_overlay_cuda.cu
It compiles fine, but can't test without a nano for actual usage.
will require to self build:
git fetch ffmpeg source code
make clean
./configure --enable-nonfree --enable-cuda
mv ~/path/to/new/file ~/path/to/ffmpeg/libavfilter/vf_overlay_cuda.cu
make -j
get coffee
ffmpeg -i INPUT -i INPUT -filter_complex 'hwupload_cuda,overlay_cuda' OUTPUT
ffplay output
is output different?
test file: https://4kmedia.org/lg-new-york-hdr-uhd-4k-demo/
Further update: Currently concerned about if the jetson will be able to use the standard library of Cuda filters in tandem with decoding and encoding, ideally without making too many memory copies.
In my test, opencl is necessary. cuda accelerated filter is usable.
you just need to
- git clone nv-codec-headers
- build ffmpeg with --enable-cuda --enable-cuda-nvcc
- enjoy however, the only usable filters is scale_cuda and yadif_cuda😅 maybe i need to backport some features. like tonemap_cuda etc.
you just need to
git clone nv-codec-headers
build ffmpeg with --enable-cuda --enable-cuda-nvcc
enjoy
however, the only usable filters is scale_cuda and yadif_cuda😅
maybe i need to backport some features. like tonemap_cuda etc.
I haven't looked at this project in a while, been working on other GPGPUrelated things.
Have they ported the cuda filters to work on the nano?
I don't know. I just test and finally build successfully.


The speed......😅 Not too bad. And the GPU usage is low. h264 1080p -> h265 720p 6Mbps
Command:
sudo ffmpeg -init_hw_device cuda=gpu:0.0 -filter_hw_device gpu -c:v h264_nvmpi -i /mnt/source/Paprika.2006.JAPANESE.1080p.BluRay.x264.DTS-FGT.mkv -vf "format=yuv420p,hwupload,scale_cuda=1280:720,hwdownload,format=yuv420p" -c:v hevc_nvmpi -b:v 6000k -preset medium -profile:v high -acodec ac3 output.mp4
Also what i should say is that my device is throttled due to low current and voltage.😅 The power is utter garbage.
cuvid is unusable because lacking of libnvcuvid.so.1 Due to the special architecture of jetson(?) The things are quite different.