vapoursynth-mvtools icon indicating copy to clipboard operation
vapoursynth-mvtools copied to clipboard

v22 slower than v21?

Open Boulder08 opened this issue 4 years ago • 14 comments

As I measured here: https://forum.doom9.org/showthread.php?p=1910541#post1910541 , the new version with speed improvements seems to be slower than the previous one. Are the CPU instruction sets properly detected? I noticed that the part doing the job is quite old and may not be up to it with these new-gen AMD Ryzens (I'm running a 3900X).

Boulder08 avatar May 06 '20 04:05 Boulder08

Only if AMD changed the way they signal AVX2 support. I don't think they did?

Because of the parameters you used, neither Super nor Degrain1 are using the new AVX2 code, which means it's Analyse that got slower.

Do you see a difference between v21 and v22 when you run Analyse on a 16 bit clip?

dubhater avatar May 06 '20 13:05 dubhater

The difference seems to be consistent.

Analyse 16-bits, v22 26.22 fps Analyse 16-bits, v21 28.07 fps Analyse 8-bits, v22 55.74 fps Analyse 8-bits, v21 57.96 fps

Which functionalities in MSuper or MDegrainx should be optimized? I could test them as well.

Boulder08 avatar May 06 '20 14:05 Boulder08

Degrain with 8 bit clips, Super with sharp=0 or 2.

dubhater avatar May 06 '20 15:05 dubhater

Same thing with those, v21 is faster.

sharp=2, v22 55.97 fps sharp=2, v21 58.17 fps sharp=2, Degrain 8-bits, v22 64.51 fps sharp=2, Degrain 8-bits, v21 66.67 fps

Boulder08 avatar May 06 '20 15:05 Boulder08

Just for fun, I checked what x264 shows: x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

So at least it's working properly.

Boulder08 avatar May 06 '20 15:05 Boulder08

Is v22 compiled with Visual Studio faster than v21? See attached.

vapoursynth-mvtools.zip

sekrit-twc avatar May 06 '20 19:05 sekrit-twc

Yes, it seems to be faster. Compared to those first tests with 8-bit Analyse and 16-bit degraining, I got 60.43 fps as the result.

Boulder08 avatar May 07 '20 04:05 Boulder08

Tried compiling with GCC 9 on Linux. v22 is running faster than v21 for me. Maybe the issue is related to MinGW and cross-compilation.

Script from Doom9 thread:

import vapoursynth as vs

core = vs.core
core.num_threads = 1

core.std.LoadPlugin("/home/user/src/vapoursynth-mvtools/.libs/libmvtools.so")

c = core.std.BlankClip(format=vs.YUV420P8) * 100
s = core.mv.Super(c, pel=2, chroma=True, rfilter=4, sharp=1)

kwargs = {"blksize": 16, "overlap": 8, "search": 5, "searchparam": 8, "pelsearch": 8, "truemotion": False}
b1v = core.mv.Analyse(s, isb=True, delta=1, **kwargs)
f1v = core.mv.Analyse(s, isb=False, delta=1, **kwargs)

kwargs = {"thsad": 200, "thsadc": 100, "limit": 1, "limitc": 2, "thscd1": 300, "thscd2": 80}
c = core.mv.Degrain1(c, s, b1v, f1v, **kwargs)
c.set_output()

Profiler results. Units are perf "cycles" events, which is a proxy for time. In this script, the AVX2 code is offering negligible speedup, because the bulk of the compute is not in SIMD code anyway, due to the mv.Super mode. The fps gains are instead coming from templating and specializing the control flow for the motion estimation.

Kernels      
sym v21 v23  
HorizontalBicubic 34449 43948 1.275741
VerticalBicubic 17062 17427 1.021393
ToPixels_uint16_t_uint8_t 11051 12583 1.13863
SADWrapperU8_AVX2<16u, 16u>::sad_u8_avx2 6707 8806 1.312957
__memset_avx2_erms 7171 7903 1.102078
SADWrapperU8<8u, 8u>::sad_u8_sse2 12621 6507 0.515569
Degrain_avx2<1, 16, 16> 9277 6028 0.649779
Degrain_avx2<1, 8, 8> 5395 4100 0.759963
RB2Cubic 4513 3595 0.796588
copyBlock<16u, 16u> 3974 3351 0.843231
overlaps_avx2<16, 16> 5026 3166 0.629924
overlaps_avx2<8, 8> 2753 2484 0.902288
copyBlock<8u, 8u> 2755 2294 0.832668
__memmove_avx_unaligned_erms 1934 1923 0.994312
PadReferenceFrame 895 1206 1.347486
LimitChanges_sse2 930 914 0.982796
  126513 126235 0.997803
       
Control Flow      
v21      
pobExpandingSearch 41311    
pobSearchMVs 32305    
pobUMHSearch 25482    
mvdegrainGetFrame<1> 17290    
pobInterpolatePrediction 11247    
mvpGetAbsolutePointerPel2 3681    
pobHex2Search 2951    
pobLumaSAD 2006    
mvpGetAbsolutePointerPel1 1989    
mvpGetAbsolutePointer 1455    
pobRefine 1331    
SUM 141048    
       
v23      
pobExpandingSearch<0, 0> 36834    
pobUMHSearch<0, 1> 28107    
mvdegrainGetFrame<1> 13456    
doPobSearchMVs<0, 1> 11239    
pobFetchPredictors 6858    
pobInterpolatePrediction 5472    
pobExpandingSearch<0, 1> 4954    
doPobSearchMVs<0, 0> 3511    
mvpGetAbsolutePointerPel2 2903    
pobHex2Search<0, 1> 2792    
pobGetRefBlockU<1> 1970    
mvpGetAbsolutePointerPel1 1938    
pobGetRefBlockV<1> 1883    
mvpGetAbsolutePointer 1606    
pobRefine<0, 1> 802    
SUM 124325    

sekrit-twc avatar May 29 '20 01:05 sekrit-twc

Which compiler flags did you use? (And Autotools or Meson?)

dubhater avatar May 29 '20 12:05 dubhater

Default autotools build (./configure && make).

sekrit-twc avatar May 29 '20 17:05 sekrit-twc

Hmm. The default with Makefile.am is -O2. Meson defaults to -O3. I compiled the v22 and v23 DLLs using Meson. (I don't know about the older ones.) Perhaps that's what makes it slower?

dubhater avatar May 30 '20 11:05 dubhater

I did some test with the above script and for me r22 and r23 are slightly faster than r21 (~4%).

GCC 10 builds are ~10% bigger than GCC 9 but just a tiny bit faster (~2%).

On my zen2 CPU I used -march=native -O2 -ftree-vectorize -fdevirtualize-at-ltrans -flto=16 -pipe but -O2 for GCC 10 is slightly different (it includes -finline-functions now).

4re avatar May 30 '20 13:05 4re

@Boulder08 Here is v23 compiled with -O2 instead of -O3. That's the only difference. Please test again. vapoursynth-mvtools-v23-O2-win64.zip

dubhater avatar Jun 06 '20 19:06 dubhater

2500 frames of a test script of analysis and degraining in 16 bits: v23-normal: 12.62 fps v23-O2: 12.07 fps v23-clang build from doom9 : 13.32 fps

So it was definitely slower.

Boulder08 avatar Jun 07 '20 11:06 Boulder08