vvenc GCC versus AOCC optimization


 	Total Frames |   Bitrate     Y-PSNR    U-PSNR    V-PSNR    YUV-PSNR   
-	      500    a   69383.0744   36.6916   39.6481   40.8076   37.5261
-finished @ Sat Jan 22 21:21:27 2022
+	      500    a   69383.2840   36.6916   39.6481   40.8076   37.5261
+finished @ Sun Jan 23 11:44:28 2022
 
-Total Time: 14645.331 sec. Fps(avg): 0.034 encoded Frames 500
+Total Time: 12097.595 sec. Fps(avg): 0.041 encoded Frames 500

GCC flags: -flto -O3 1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP22.266.log

AOCC flags: -march=znver3 -flto -Ofast -mllvm -enable-strided-vectorization 1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP22.266.log

Snímka obrazovky z 2022-01-23 17-22-49

In other words, AOCC produced 21% faster code than GCC.

Jan 23 '22 18:01 1div0

Interesting. Are both bitstreams decodable? You are encoding with the DPH SEI enabled. Is it correctly reconstructed by the decoder?

It would be interesting to know where the difference comes from. Could you check if the AOCC executable provides the same result with --SIMD=SCALAR. If not, there is an implementation problem somewhere. If yes, the difference is probably caused by some floating point calculation instability, which would be annoying but acceptable.

Jan 24 '22 05:01 adamjw24

All bitstreams with QP equal to 22, 27, 32, 37, 42, 47 are perfectly decodable with the VVdeC version 1.3.0.

I will restart the encoding with SIMD scalar and check the results later today.

Jan 24 '22 09:01 1div0

[peter.kovar@vmi728485 ~]$ VVenC.sh 
+ COMPILER=GCC
+ VERSION=8.5.0
+ CONFIGURATION=GCC/8.5.0
+ ENCODER=/usr/local/GCC/8.5.0/bin/vvencFFapp
+ OUTPUT_PATH=/home/peter.kovar/Video/VVC/GCC/8.5.0
+ mkdir -p /home/peter.kovar/Video/VVC/GCC/8.5.0
+ HORIZONTAL=3840
+ VERTICAL=2160
+ SIZE=3840x2160
+ RATE=50
+ NAME=1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_
+ for QP in 32
+ INPUT=/home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv
+ OUTPUT=/home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266
+ LOG=/home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266.log
+ nice /usr/local/GCC/8.5.0/bin/vvencFFapp --InputFile /home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv --Size 3840x2160 --framerate 50 --InputBitDepth 10 --QP 32 --SIMD=SCALAR --Threads 8 --BitstreamFile /home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266

real	178m35,012s
user	1031m1,738s
sys	3m47,406s
[peter.kovar@vmi728485 ~]$ vim Scripts/VVenC.sh 
[peter.kovar@vmi728485 ~]$ VVenC.sh 
+ COMPILER=AOCC
+ VERSION=3.2.0
+ CONFIGURATION=AOCC/3.2.0
+ ENCODER=/usr/local/AOCC/3.2.0/bin/vvencFFapp
+ OUTPUT_PATH=/home/peter.kovar/Video/VVC/AOCC/3.2.0
+ mkdir -p /home/peter.kovar/Video/VVC/AOCC/3.2.0
+ HORIZONTAL=3840
+ VERTICAL=2160
+ SIZE=3840x2160
+ RATE=50
+ NAME=1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_
+ for QP in 32
+ INPUT=/home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv
+ OUTPUT=/home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266
+ LOG=/home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266.log
+ nice /usr/local/AOCC/3.2.0/bin/vvencFFapp --InputFile /home/peter.kovar/Video/YUV/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.yuv --Size 3840x2160 --framerate 50 --InputBitDepth 10 --QP 32 --SIMD=SCALAR --Threads 8 --BitstreamFile /home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266

real	144m1,126s
user	864m45,966s
sys	2m23,265s

€ diff -u '/run/user/1001/gvfs/sftp:host=düsseldorf.reflexion.tv/home/peter.kovar/Video/VVC/GCC/8.5.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266.log' '/run/user/1001/gvfs/sftp:host=düsseldorf.reflexion.tv/home/peter.kovar/Video/VVC/AOCC/3.2.0/1_CrowdRun_2160p50_CgrLevels_MASTER_SVTdec05_.QP32.266.log' > ~/"GCC 8.5.0 versus AOCC 3.2.0 comparison.diff.txt"

GCC 8.5.0 versus AOCC 3.2.0 comparison.diff.txt Snímka obrazovky z 2022-01-24 18-17-01

AOCC generated encoder was 24% faster.!?

Jan 24 '22 17:01 1div0

Thanks for checking for conformance. Without SIMD the results seems to be the same, but there is actually some floating point SIMD in the encoder. So even so, the difference could also potentially be uncritical. I'll try to track it down sometime, but it seems uncritical - probably floating point operation influencing an encoding decision.

The speed-up really is impressive. Thanks for sharing! I'm actually surprised because with the amount of manual optimization we did, I didn't think an architecture optimizing compiler would matter so much. It'd be interesting to see the profiling to get an idea where AOCC was able to optimize so much (e.g. with ENABLE_TIME_PROFILING). I might have a look sometime.

We cannot really act on it though.

If you want to simplify your build process to utilize this, you can specify the target arch directly in the make-cmd as:

$ make clean
$ make release ... enable-arch=znver3

Jan 24 '22 18:01 adamjw24

There is not utilized AVX3-512 yet. I will try PGO during this week and share the measured results.

Jan 24 '22 18:01 1div0

AVX2 brings max 10% over SSE42, so I wouldnt get my hopes up for AVX512.

If you find a way to automate PGO as a part of our CMake build process, feel free to make a pull request. Looking forward to the results.

Jan 24 '22 22:01 adamjw24

It is not easy.

ccmake ../../../../..

CCACHE_FOUND                     /usr/bin/ccache
CMAKE_ADDR2LINE                  /usr/bin/addr2line
CMAKE_AR TALL_PREFIX             /usr/bin/ar OCC/3.2.0
CMAKE_BUILD_TYPE BLE_ITT         Debug
CMAKE_COLOR_MAKEFILE ON
CMAKE_CXX_COMPILER               /opt/AMD/aocc-compiler-3.2.0/bin/clang++
CMAKE_CXX_COMPILER_AR            /opt/AMD/aocc-compiler-3.2.0/bin/llvm-ar
CMAKE_CXX_COMPILER_RANLIB /opt/AMD/aocc-compiler-3.2.0/bin/llvm-ranlib
CMAKE_CXX_FLAGS                  -march=znver3 -flto -Ofast -mllvm -enable-strided-vectorization
CMAKE_CXX_FLAGS_DEBUG            -g -fprofile-instr-generate
CMAKE_CXX_FLAGS_MINSIZEREL       -Os -DNDEBUG
CMAKE_CXX_FLAGS_PROFILE          -O0 -fprofile-instr-generate
CMAKE_CXX_FLAGS_RELEASE          -O3 -DNDEBUG -fprofile-instr-use
CMAKE_CXX_FLAGS_RELWITHDEBINFO   -O2 -g -DNDEBUG

time make --jobs 8

real    9m1,768s
user    36m5,663s
sys     0m55,690s

/opt/AMD/aocc-compiler-3.2.0/bin/llvm-profdata merge -output=default.profdata default.profraw

CMAKE_CXX_FLAGS_RELEASE          -O3 -fprofile-instr-use=/usr/src/github.com/1div0/vvenc/Linux/x86-64/EPYC/AOCC/3.2.0/default.profdata

[peter.kovar@vmi728485 3.2.0]$ VVenC.sh 
+ COMPILER=AOCC
+ VERSION=3.2.0
+ CONFIGURATION=AOCC/3.2.0
+ ENCODER=/usr/local/AOCC/3.2.0/bin/vvencFFapp
+ OUTPUT_PATH=/home/peter.kovar/Video/VVC/AOCC/3.2.0
+ mkdir -p /home/peter.kovar/Video/VVC/AOCC/3.2.0
+ HORIZONTAL=1920
+ VERTICAL=1080
+ SIZE=1920x1080
+ RATE=24
+ NAME=Kimono1_1920x1080_24
+ for QP in 32
+ INPUT=/home/peter.kovar/Video/YUV/Kimono1_1920x1080_24.yuv
+ OUTPUT=/home/peter.kovar/Video/VVC/AOCC/3.2.0/Kimono1_1920x1080_24.QP32.266
+ LOG=/home/peter.kovar/Video/VVC/AOCC/3.2.0/Kimono1_1920x1080_24.QP32.266.log
+ nice /usr/local/AOCC/3.2.0/bin/vvencFFapp --InputFile /home/peter.kovar/Video/YUV/Kimono1_1920x1080_24.yuv --Size 1920x1080 --framerate 24 --InputBitDepth 8 --QP 32 --Threads 8 --BitstreamFile /home/peter.kovar/Video/VVC/AOCC/3.2.0/Kimono1_1920x1080_24.QP32.266

real    405m3,083s
user    2877m46,438s
sys     1m20,165s

/opt/AMD/aocc-compiler-3.2.0/bin/llvm-profdata merge -output=default.profdata default.profraw

[peter.kovar@vmi728485 3.2.0]$ file default.prof*
default.profdata: LLVM indexed profile data, version 7
default.profraw:  LLVM raw profile data, version 7

[peter.kovar@vmi728485 3.2.0]$ time make --jobs 8
[  1%] Building CXX object source/Lib/apputils/CMakeFiles/apputils.dir/ParseArg.cpp.o
[  1%] Building CXX object source/Lib/apputils/CMakeFiles/apputils.dir/YuvFileIO.cpp.o
[  2%] Building CXX object source/Lib/apputils/CMakeFiles/apputils.dir/VVEncAppCfg.cpp.o
[  3%] Linking CXX static library ../../../../../../../../lib/release-static/libapputils.a
[  3%] Built target apputils
[  3%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/AdaptiveLoopFilter.cpp.o
[  4%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/AffineGradientSearch.cpp.o
[  5%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/BitStream.cpp.o
[  5%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/CodingStructure.cpp.o
[  7%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/ContextModelling.cpp.o
[  7%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Buffer.cpp.o
[  8%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Contexts.cpp.o
[  9%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/DepQuant.cpp.o
[  9%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/InterPrediction.cpp.o
[ 10%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/InterpolationFilter.cpp.o
[ 11%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/IntraPrediction.cpp.o
[ 12%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/LoopFilter.cpp.o
[ 12%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/MCTF.cpp.o
[ 13%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/MatrixIntraPrediction.cpp.o
[ 14%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Mv.cpp.o
[ 15%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/PicYuvMD5.cpp.o
[ 15%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Picture.cpp.o
[ 16%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/ProfileLevelTier.cpp.o
[ 17%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Quant.cpp.o
[ 18%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/QuantRDOQ.cpp.o
[ 18%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/QuantRDOQ2.cpp.o
[ 19%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/RdCost.cpp.o
[ 20%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Reshape.cpp.o
[ 21%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Rom.cpp.o
[ 21%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/RomTr.cpp.o
[ 22%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/SEI.cpp.o
[ 23%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/SampleAdaptiveOffset.cpp.o
[ 24%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/SearchSpaceCounter.cpp.o
[ 24%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Slice.cpp.o
[ 25%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/StatCounter.cpp.o
[ 26%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/TimeProfiler.cpp.o
[ 27%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/TrQuant.cpp.o
[ 27%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/TrQuant_EMT.cpp.o
[ 28%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/Unit.cpp.o
[ 29%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/UnitPartitioner.cpp.o
error: no profile data available for file "StatCounter.cpp" [-Werror,-Wprofile-instr-unprofiled]
1 error generated.
make[2]: *** [source/Lib/vvenc/CMakeFiles/vvenc.dir/build.make:482: source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/StatCounter.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:188: source/Lib/vvenc/CMakeFiles/vvenc.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

real    0m50,479s
user    3m9,502s
sys     0m22,095s

Feb 02 '22 10:02 1div0

@adamjw24 Am I doing something wrong here?

Feb 02 '22 10:02 1div0

Hmm... from the log files I understand that to do a profile based build, you need profiling info for every object? This will not be possible with vvenc for following reason:

the tracing and instrumentation functionalities are disabled per default, so the object would be empty. Also, no one care about the performance of the tracing and the instrumentation.
the decoding functionality is not used most of the time, it would be inpractical to generate profiling data for it only because of the build
stuff like weighted prediction is only used with specific configs, so the files might also not be used (i.e. no profiling data for those files)

Feb 02 '22 11:02 adamjw24

Oh, wait, I just had a second look, and found the following:

[ 29%] Building CXX object source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/UnitPartitioner.cpp.o
error: no profile data available for file "StatCounter.cpp" [-Werror,-Wprofile-instr-unprofiled]
1 error generated.
make[2]: *** [source/Lib/vvenc/CMakeFiles/vvenc.dir/build.make:482: source/Lib/vvenc/CMakeFiles/vvenc.dir/__/CommonLib/StatCounter.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....

It looks like you should add the following flag to the build: -Wnoprofile-instr-unprofiled (or however the syntax is to disable -Wprofile-instr-unprofiled)

Feb 02 '22 11:02 adamjw24

Dziękuję.

-Wno-profile-instr-unprofiled

And compiler frontend just exploded.

Going to report that to the AMD. EncCu-daf955.cpp.txt EncCu-daf955.sh.txt

Feb 02 '22 12:02 1div0

Proszę.

You might try with an older clang version. We had some some issues with bleeding edge compilers a few times already.

Feb 02 '22 13:02 adamjw24

Closed accidentally. I misread the issue number to close.

Feb 04 '22 08:02 adamjw24

AOCC 3.1.0 based on LLVM 12.0.0 just compiled OK.

//Flags used by the CXX compiler during RELEASE builds. CMAKE_CXX_FLAGS_RELEASE:STRING=-Ofast -flto -mllvm -enable-strided-vectorization -Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date -fprofile-instr-use=/usr/src/github.com/1div0/vvenc/Linux/x86-64/EPYC/AOCC/3.1.0/default.profdata

Feb 15 '22 08:02 1div0

Result in https://düsseldorf.reflexion.tv/nextcloud/index.php/s/Y9E884z6wAkNNSZ

Feb 15 '22 08:02 1div0

Recently, I have compiled the LLVM v15 Clang compiler and discovered this: /usr/src/github.com/1div0/vvenc/source/Lib/EncoderLib/IntraSearch.cpp:2509:27: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical] currTU.jointCbCr = (TU::getCbf(currTU, COMP_Cb) | TU::getCbf(currTU, COMP_Cr)) ? bestJointCbCr : 0; ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ || /usr/src/github.com/1div0/vvenc/source/Lib/EncoderLib/IntraSearch.cpp:2509:27: note: cast one or both operands to int to silence this warning

Apr 02 '22 15:04 1div0

I think clang v15 is way too early to take its warnings seriously. We had a lot of problems with early compiler version, and I'd rather wait out for those to mature a bit.

Might pull in the PR tho, since it seems logical (no pun intended).

Apr 04 '22 13:04 adamjw24

So far no word from the AMD about compiler crash. However, these results are confirming my observation. https://www.phoronix.com/scan.php?page=article&item=amd-aocc-milanx&num=4

Apr 12 '22 11:04 1div0

Closing for now, as not really actionable. This is more informative.

Nov 29 '22 16:11 adamjw24

vvenc vvenc copied to clipboard

GCC versus AOCC optimization

vvenc
vvenc copied to clipboard