jetson-ffmpeg icon indicating copy to clipboard operation
jetson-ffmpeg copied to clipboard

Poor encoding/transcoding performance (AVC/HEVC)

Open cizekmilan opened this issue 2 years ago • 10 comments

Hello, can anyone provide what looks like the load of his GPU utilization in jtop?

I have two test of transcoding, both examples GPUs use little. (only encoding give me the same poor results) ffmpeg -y -benchmark -c:v h264_nvmpi -i "./samples/CT-Sport-AVC-1080p50-v.ts" -c:v hevc_nvmpi test_hevc.ts ffmpeg -y -benchmark -c:v hevc_nvmpi -i "./samples/CT24-HEVC-1080p50-v.ts" -c:v h264_nvmpi test_h264.ts

https://snipboard.io/m5B3xv.jpg https://snipboard.io/m6LOzP.jpg https://snipboard.io/1xlSK8.jpg

Is it normal? I'm disappointed with too low performance on my Jetson Nano.

ffmpeg version c9f3835 Copyright (c) 2000-2020 the FFmpeg developers built with gcc 7 (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) configuration: --enable-static --pkg-config-flags=--static --disable-shared --prefix=/tmp/stream_install/ffmpeg-build-static --extra-cflags='-I /tmp/stream_install/ffmpeg-build-static/include -I /usr/local/cuda/include/' --extra-ldflags='-L /tmp/stream_install/ffmpeg-build-static/lib -L /usr/local/cuda/lib64/' --enable-gpl --enable-nonfree --enable-libfdk-aac --enable-nvmpi --enable-cuda-nvcc --enable-libnpp --enable-pthreads --enable-runtime-cpudetect --enable-filter=drawtext --enable-libfreetype

Inputs: mediainfo CT-Sport-AVC-1080p50-v.ts

General ID : 1 (0x1) Complete name : CT-Sport-AVC-1080p50-v.ts Format : MPEG-TS File size : 258 MiB Duration : 4 min 58 s Overall bit rate mode : Variable Overall bit rate : 7 244 kb/s

Video ID : 256 (0x100) Menu ID : 1 (0x1) Format : AVC Format/Info : Advanced Video Codec Format profile : [email protected] Format settings : CABAC / 4 Ref Frames Format settings, CABAC : Yes Format settings, Reference frames : 4 frames Codec ID : 27 Duration : 4 min 58 s Bit rate : 6 886 kb/s Width : 1 920 pixels Height : 1 080 pixels Display aspect ratio : 16:9 Frame rate mode : Variable Color space : YUV Chroma subsampling : 4:2:0 Bit depth : 8 bits Scan type : Progressive Stream size : 245 MiB (95%) Color range : Limited Color primaries : BT.709 Transfer characteristics : BT.709 Matrix coefficients : BT.709

Menu ID : 4096 (0x1000) Menu ID : 1 (0x1) Duration : 4 min 58 s List : 256 (0x100) (AVC) Service name : Service01 Service provider : FFmpeg Service type : digital television

mediainfo CT24-HEVC-1080p50-v.ts

General ID : 1 (0x1) Complete name : CT24-HEVC-1080p50-v.ts Format : MPEG-TS File size : 141 MiB Duration : 4 min 58 s Overall bit rate mode : Variable Overall bit rate : 3 962 kb/s

Video ID : 256 (0x100) Menu ID : 1 (0x1) Format : HEVC Format/Info : High Efficiency Video Coding Format profile : [email protected]@Main Codec ID : 36 Duration : 4 min 58 s Bit rate : 3 767 kb/s Width : 1 920 pixels Height : 1 080 pixels Display aspect ratio : 16:9 Frame rate : 50.000 FPS Color space : YUV Chroma subsampling : 4:2:0 Bit depth : 8 bits Bits/(Pixel*Frame) : 0.036 Stream size : 134 MiB (95%) Color range : Limited Color primaries : BT.709 Transfer characteristics : BT.709 Matrix coefficients : BT.709

Menu ID : 4096 (0x1000) Menu ID : 1 (0x1) Duration : 4 min 58 s List : 256 (0x100) (HEVC) Service name : Service01 Service provider : FFmpeg Service type : digital television

cizekmilan avatar Aug 18 '21 22:08 cizekmilan

Can anyone provide what looks like the load of his GPU utilization in jtop?

In your second picture, the NVENC and NVDEC are showing as running. https://snipboard.io/m6LOzP.jpg

The NVENC and NVDEC usage does not show up on the GPU view of jtop. Your ffmpeg is working, and you're using hardware acceleration for both decode and encode.

Is it normal? I'm disappointed with too low performance on my Jetson Nano.

What specifically are you disappointed with, and what were you expecting? framerate? encode quality? something else?

grantthomas avatar Aug 18 '21 23:08 grantthomas

Hello,

What specifically are you disappointed with, and what were you expecting? framerate? encode quality? something else?

you are right. I'm getting probably even higher performance at hevc than stated in the specification. https://www.stereolabs.com/blog/h-264-h-265-video-encoding-support-matrix-for-nvidia-jetson/

I expect 4x FullHD H.264@30fps (total 30*4 =~ 120fps)... result frame=14936 fps=126 q=-0.0 Lsize= 132561kB time=00:04:58.66 bitrate=3636.1kbits/s speed=2.51x

I expect 1x HEVC H.264@30fps (total 1*4 =~ 30fps)... result frame=14940 fps=109 q=-0.0 Lsize= 131045kB time=00:04:58.58 bitrate=3595.4kbits/s speed=2.19x According to the results, I can't complain, I was confused by the jtop output, which created the suspicion that not everything is working properly.

Thank you.

cizekmilan avatar Aug 18 '21 23:08 cizekmilan

No problem.

Keep in mind that the Nano is built on the Maxwell architecture, and so will be limited to that generations feature set and quality per bitrate.

https://en.wikipedia.org/wiki/CUDA

If you find that the quality for the bitrate is low, you can try the Jetson NX, which I have moved to because of quality issues.

You can spin up a google colab instance and try the NVENC on their (usually) V100 GPUs for free to see what improvements it gives you. The V100 and the NX share the Volta architecture (I think that's right).

grantthomas avatar Aug 18 '21 23:08 grantthomas

I have one more question. It seems to me that the bitrate setting is ignored, compared to other codecs. I define e.g.: to HEVC: -profile:v main -level:v 4.1 -minrate 4000k -b:v 4500k -maxrate 6000k to H264: -profile: v high -level: v 4 -minrate 4000k -b: v 4500k -maxrate 6000k

But the resulting bitrate averages around 8Mbit. The resulting file is, for example, 2x larger (both H.264 and HEVC). This is how the file size comparison based on codecs with the above defined parameters looks like: https://snipboard.io/h8y5De.jpg (you can see the size difference). Thank you.

cizekmilan avatar Aug 19 '21 22:08 cizekmilan

On some testing I was doing a few days ago, it seems that different commits behave differently with regard to bit-rate. 2 questions:

  • Do you know the commit of ffmpeg and jetson-ffmpeg you used to compile your running version?
  • Are you able to share your testing videos?

For example, on one of my test encodes, one of the build operated correctly at -b:v 120k On a different build, I needed to specify -b:v 1200k in order for it to encode at 120k

On my Jetson NX, This command: ffmpeg -r 10 -c:v hevc_nvmpi -i ~/input-yuv.mp4 -c:v h264_nvmpi -minrate 85k -b:v 120k -maxrate 128k -g 30 -y output-yuv420.mkv Produced this:

Complete name                            : output-yuv420.mkv
Format                                   : Matroska
Format version                           : Version 4 / Version 2
File size                                : 1.36 MiB
Duration                                 : 5 min 16 s
Overall bit rate                         : 36.0 kb/s
Movie name                               : Media Server
Writing application                      : Lavf58.29.100
Writing library                          : Lavf58.29.100
ErrorDetectionType                       : Per level 1

Video
ID                                       : 1
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : [email protected]
Format settings                          : CABAC / 4 Ref Frames
Format settings, CABAC                   : Yes
Format settings, ReFrames                : 4 frames
Format settings, GOP                     : M=1, N=30
Codec ID                                 : V_MPEG4/ISO/AVC
Duration                                 : 5 min 16 s
Bit rate                                 : 35.3 kb/s
Width                                    : 480 pixels
Height                                   : 640 pixels
Display aspect ratio                     : 0.562
Original display aspect ratio            : 0.750
Frame rate mode                          : Constant
Frame rate                               : 10.000 FPS
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive
Bits/(Pixel*Frame)                       : 0.011
Stream size                              : 1.33 MiB (98%)
Writing library                          : Lavc58.54.100 h264_nvmpi
Default                                  : Yes
Forced                                   : No

grantthomas avatar Aug 19 '21 22:08 grantthomas

Hi, here is my shared folder with all results: https://bit.ly/3k29qKg. Please check script for transcoding DP_cpu-nvmpi-test.sh (I deliberately do not use hw decoding here). In some cases I do deinterlace via CPU.

Input files are located in ./samples. Transcoded files in ./transcoded directory.

build_ffmpeg_jetson.sh is my script for build custom static ffmpeg binary. You can use with -t argument for testing function of nvmpi encoders and decoders. The reported results were compiled using the today versions.

cizekmilan avatar Aug 19 '21 23:08 cizekmilan

Thanks a bunch.

I'm not one of the devs or contributors to this project, but I'll try to take a look over the weekend and see if I can figure anything out.

grantthomas avatar Aug 20 '21 16:08 grantthomas

I can say that Jetson Nano 2GB is the cheapest and fastest machine, only for 50$. You can build a transcode cluster😁.

Here : 1080P 30FPS H264 60Mbps -> HEVC 8Mbps preset=slow get about 3.5x speed, and ultrafast get about 3.8x.(Because the Encoder 's MAX Speed is 4x) And the NVDEC & NVENC is both running at 716MHz.

YOU MUST MAKE SURE THAT THE CURRENT IS ENOUGH and the power mode is NMAX!

Beside I am thinking about how to use CUDA to decode/encode to speed it up.

AnterCreeper avatar Nov 14 '21 15:11 AnterCreeper

I'm seeing extremely poor performance as well, on Xavier NX.

The attraction of using nvmpi for us was to offload some of the decoder load. So, the expectation was when nvmpi is used to see a lower CPU utilization. However, this doesn't seem to be the case.

Running with the following command (using RTSP to generate a constant fps / simulation of real time load): ./bin/ffmpeg -i rtsp://10.10.101.12/30fps.mkv -f null - -benchmark tegrastats output looks like this:

RAM 1480/7761MB (lfb 706x4MB) CPU [12%@1190,10%@1190,12%@1190,13%@1190,off,off] EMC_FREQ 0%@1600 GR3D_FREQ 0%@114 APE 150 MTS fg 0% bg 1% [email protected] GPU@41C PMIC@100C AUX@41C CPU@43C [email protected] VDD_IN 3632/3611 VDD_CPU_GPU_CV 473/463 VDD_SOC 1103/1103
RAM 1480/7761MB (lfb 706x4MB) CPU [12%@1190,14%@1190,12%@1190,12%@1190,off,off] EMC_FREQ 0%@1600 GR3D_FREQ 0%@114 APE 150 MTS fg 0% bg 1% AO@41C GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 3632/3611 VDD_CPU_GPU_CV 473/463 VDD_SOC 1103/1103
RAM 1480/7761MB (lfb 706x4MB) CPU [16%@1190,14%@1190,14%@1190,12%@1190,off,off] EMC_FREQ 0%@1600 GR3D_FREQ 0%@114 APE 150 MTS fg 0% bg 1% AO@41C GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 3672/3611 VDD_CPU_GPU_CV 513/463 VDD_SOC 1103/1103
RAM 1480/7761MB (lfb 706x4MB) CPU [13%@1190,14%@1190,14%@1190,13%@1190,off,off] EMC_FREQ 0%@1600 GR3D_FREQ 0%@114 APE 150 MTS fg 0% bg 0% AO@41C GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 3632/3611 VDD_CPU_GPU_CV 473/463 VDD_SOC 1103/1103
RAM 1480/7761MB (lfb 706x4MB) CPU [9%@1190,12%@1190,12%@1190,14%@1190,off,off] EMC_FREQ 0%@1600 GR3D_FREQ 0%@114 APE 150 MTS fg 0% bg 1% AO@41C GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 3632/3611 VDD_CPU_GPU_CV 473/463 VDD_SOC 1103/1103

You can see that CPU utilization is well under 20% for all the CPUs. Adding -vcodec h264_nvmpi causes the CPU utilization to rise dramatically:

RAM 1501/7761MB (lfb 706x4MB) CPU [27%@1190,33%@1190,26%@1190,24%@1190,off,off] EMC_FREQ 3%@1600 GR3D_FREQ 0%@114 NVENC 115 NVENC1 115 APE 150 MTS fg 0% bg 1% [email protected] GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 4218/3621 VDD_CPU_GPU_CV 709/469 VDD_SOC 1338/1105
RAM 1501/7761MB (lfb 706x4MB) CPU [26%@1190,24%@1190,21%@1190,22%@1190,off,off] EMC_FREQ 3%@1600 GR3D_FREQ 0%@114 NVENC 115 NVENC1 115 APE 150 MTS fg 0% bg 2% [email protected] GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 4139/3622 VDD_CPU_GPU_CV 630/470 VDD_SOC 1338/1106
RAM 1501/7761MB (lfb 706x4MB) CPU [26%@1190,23%@1190,20%@1190,22%@1190,off,off] EMC_FREQ 3%@1600 GR3D_FREQ 0%@114 NVENC 115 NVENC1 115 APE 150 MTS fg 0% bg 1% [email protected] GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 4139/3624 VDD_CPU_GPU_CV 630/470 VDD_SOC 1338/1106
RAM 1501/7761MB (lfb 706x4MB) CPU [27%@1190,24%@1190,21%@1190,25%@1190,off,off] EMC_FREQ 3%@1600 GR3D_FREQ 0%@114 NVENC 115 NVENC1 115 APE 150 MTS fg 0% bg 1% [email protected] GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 4139/3625 VDD_CPU_GPU_CV 670/470 VDD_SOC 1338/1107
RAM 1501/7761MB (lfb 706x4MB) CPU [15%@1190,15%@1190,25%@1190,40%@1190,off,off] EMC_FREQ 3%@1600 GR3D_FREQ 0%@114 NVENC 115 NVENC1 115 APE 150 MTS fg 0% bg 2% [email protected] GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 4139/3626 VDD_CPU_GPU_CV 670/471 VDD_SOC 1338/1107
RAM 1501/7761MB (lfb 706x4MB) CPU [16%@1190,15%@1190,30%@1190,37%@1190,off,off] EMC_FREQ 3%@1600 GR3D_FREQ 0%@114 NVENC 115 NVENC1 115 APE 150 MTS fg 0% bg 2% [email protected] GPU@41C PMIC@100C [email protected] CPU@43C [email protected] VDD_IN 4139/3627 VDD_CPU_GPU_CV 630/471 VDD_SOC 1338/1108

What am I missing here?

EDIT: it is worth mentioning, that a similar GStreamer pipeline shows much better results: gst-launch-1.0 rtspsrc location=rtsp://10.10.101.12/30fps.mkv ! rtph264depay ! h264parse ! nvv4l2decoder ! nvvidconv ! video/x-raw, format=RGBA ! fakesink

RAM 1469/7761MB (lfb 706x4MB) CPU [5%@1190,4%@1190,9%@1190,6%@1190,off,off] EMC_FREQ 2%@1600 GR3D_FREQ 0%@114 NVDEC 665 NVDEC1 665 APE 150 MTS fg 0% bg 3% [email protected] [email protected] PMIC@100C AUX@40C [email protected] [email protected] VDD_IN 3632/3632 VDD_CPU_GPU_CV 355/355 VDD_SOC 1222/1222
RAM 1469/7761MB (lfb 706x4MB) CPU [6%@1190,3%@1190,2%@1190,3%@1190,off,off] EMC_FREQ 2%@1600 GR3D_FREQ 0%@114 NVDEC 665 NVDEC1 665 APE 150 MTS fg 0% bg 1% [email protected] GPU@41C PMIC@100C [email protected] CPU@42C [email protected] VDD_IN 3593/3612 VDD_CPU_GPU_CV 315/335 VDD_SOC 1222/1222
RAM 1469/7761MB (lfb 706x4MB) CPU [5%@1190,4%@1190,3%@1190,5%@1190,off,off] EMC_FREQ 2%@1600 GR3D_FREQ 0%@114 NVDEC 665 NVDEC1 665 APE 150 MTS fg 0% bg 1% [email protected] [email protected] PMIC@100C [email protected] [email protected] [email protected] VDD_IN 3593/3606 VDD_CPU_GPU_CV 315/328 VDD_SOC 1222/1222
RAM 1469/7761MB (lfb 706x4MB) CPU [5%@1190,3%@1190,1%@1190,3%@1190,off,off] EMC_FREQ 2%@1600 GR3D_FREQ 0%@114 NVDEC 665 NVDEC1 665 APE 150 MTS fg 0% bg 1% [email protected] [email protected] PMIC@100C [email protected] [email protected] [email protected] VDD_IN 3593/3602 VDD_CPU_GPU_CV 315/325 VDD_SOC 1222/1222
RAM 1469/7761MB (lfb 706x4MB) CPU [4%@1190,2%@1190,1%@1190,2%@1190,off,off] EMC_FREQ 2%@1600 GR3D_FREQ 0%@114 NVDEC 665 NVDEC1 665 APE 150 MTS fg 0% bg 1% [email protected] [email protected] PMIC@100C AUX@40C [email protected] [email protected] VDD_IN 3593/3600 VDD_CPU_GPU_CV 315/323 VDD_SOC 1222/1222
RAM 1469/7761MB (lfb 706x4MB) CPU [7%@1190,6%@1190,0%@1190,2%@1190,off,off] EMC_FREQ 2%@1600 GR3D_FREQ 0%@114 NVDEC 665 NVDEC1 665 APE 150 MTS fg 0% bg 3% [email protected] [email protected] PMIC@100C [email protected] [email protected] [email protected] VDD_IN 3593/3599 VDD_CPU_GPU_CV 315/321 VDD_SOC 1222/1222

w3sip avatar Nov 19 '21 23:11 w3sip

I'm seeing extremely poor performance as well, on Xavier NX.

You can see that CPU utilization is well under 20% for all the CPUs. Adding -vcodec h264_nvmpi causes the CPU utilization to rise dramatically:

What am I missing here?

Do you have your full command? you need to specify both decode and encode acceleration, and the way I'm reading your comment above you're only adding in the parameters for one side. Looking at your tegrastats output, it seems like you're using no acceleration in the first section, then your second set looks like you're using the NVENC encoder but not accelerating decoding. Your gstreamer output shows the NVDEC being used but not the encoder, which makes sense using a RAW output sink

For example ./bin/ffmpeg -c:v h264_nvmpi -i rtsp://10.10.101.12/30fps.mkv -c:v hevc_nvmpi -minrate 85k -b:v 120k -maxrate 128k AVC_Content.mkv should use the AVC decoder -c:v h264_nvmpi -i ... and then encode with the HEVC encoder -c:v hevc_nvmpi -minrate ..., both accelerated.

I prefer to take a look through jtop instead of tegrastats just because it's easier to see what's going on at a glance.

In the second picture from the first post, you can see that NVENC and NVDEC are both being utilized: jtop in use source: jetson stats

grantthomas avatar Nov 22 '21 15:11 grantthomas