vapoursynth-mvtools v22 slower than v21?

As I measured here: https://forum.doom9.org/showthread.php?p=1910541#post1910541 , the new version with speed improvements seems to be slower than the previous one. Are the CPU instruction sets properly detected? I noticed that the part doing the job is quite old and may not be up to it with these new-gen AMD Ryzens (I'm running a 3900X).

May 06 '20 04:05 Boulder08

Only if AMD changed the way they signal AVX2 support. I don't think they did?

Because of the parameters you used, neither Super nor Degrain1 are using the new AVX2 code, which means it's Analyse that got slower.

Do you see a difference between v21 and v22 when you run Analyse on a 16 bit clip?

May 06 '20 13:05 dubhater

The difference seems to be consistent.

Analyse 16-bits, v22 26.22 fps Analyse 16-bits, v21 28.07 fps Analyse 8-bits, v22 55.74 fps Analyse 8-bits, v21 57.96 fps

Which functionalities in MSuper or MDegrainx should be optimized? I could test them as well.

May 06 '20 14:05 Boulder08

Degrain with 8 bit clips, Super with sharp=0 or 2.

May 06 '20 15:05 dubhater

Same thing with those, v21 is faster.

sharp=2, v22 55.97 fps sharp=2, v21 58.17 fps sharp=2, Degrain 8-bits, v22 64.51 fps sharp=2, Degrain 8-bits, v21 66.67 fps

May 06 '20 15:05 Boulder08

Just for fun, I checked what x264 shows: x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

So at least it's working properly.

May 06 '20 15:05 Boulder08

Is v22 compiled with Visual Studio faster than v21? See attached.

vapoursynth-mvtools.zip

May 06 '20 19:05 sekrit-twc

Yes, it seems to be faster. Compared to those first tests with 8-bit Analyse and 16-bit degraining, I got 60.43 fps as the result.

May 07 '20 04:05 Boulder08

Tried compiling with GCC 9 on Linux. v22 is running faster than v21 for me. Maybe the issue is related to MinGW and cross-compilation.

Script from Doom9 thread:

import vapoursynth as vs

core = vs.core
core.num_threads = 1

core.std.LoadPlugin("/home/user/src/vapoursynth-mvtools/.libs/libmvtools.so")

c = core.std.BlankClip(format=vs.YUV420P8) * 100
s = core.mv.Super(c, pel=2, chroma=True, rfilter=4, sharp=1)

kwargs = {"blksize": 16, "overlap": 8, "search": 5, "searchparam": 8, "pelsearch": 8, "truemotion": False}
b1v = core.mv.Analyse(s, isb=True, delta=1, **kwargs)
f1v = core.mv.Analyse(s, isb=False, delta=1, **kwargs)

kwargs = {"thsad": 200, "thsadc": 100, "limit": 1, "limitc": 2, "thscd1": 300, "thscd2": 80}
c = core.mv.Degrain1(c, s, b1v, f1v, **kwargs)
c.set_output()

Profiler results. Units are perf "cycles" events, which is a proxy for time. In this script, the AVX2 code is offering negligible speedup, because the bulk of the compute is not in SIMD code anyway, due to the mv.Super mode. The fps gains are instead coming from templating and specializing the control flow for the motion estimation.

Kernels
sym	v21	v23
HorizontalBicubic	34449	43948	1.275741
VerticalBicubic	17062	17427	1.021393
ToPixels_uint16_t_uint8_t	11051	12583	1.13863
SADWrapperU8_AVX2<16u, 16u>::sad_u8_avx2	6707	8806	1.312957
__memset_avx2_erms	7171	7903	1.102078
SADWrapperU8<8u, 8u>::sad_u8_sse2	12621	6507	0.515569
Degrain_avx2<1, 16, 16>	9277	6028	0.649779
Degrain_avx2<1, 8, 8>	5395	4100	0.759963
RB2Cubic	4513	3595	0.796588
copyBlock<16u, 16u>	3974	3351	0.843231
overlaps_avx2<16, 16>	5026	3166	0.629924
overlaps_avx2<8, 8>	2753	2484	0.902288
copyBlock<8u, 8u>	2755	2294	0.832668
__memmove_avx_unaligned_erms	1934	1923	0.994312
PadReferenceFrame	895	1206	1.347486
LimitChanges_sse2	930	914	0.982796
	126513	126235	0.997803

Control Flow
v21
pobExpandingSearch	41311
pobSearchMVs	32305
pobUMHSearch	25482
mvdegrainGetFrame<1>	17290
pobInterpolatePrediction	11247
mvpGetAbsolutePointerPel2	3681
pobHex2Search	2951
pobLumaSAD	2006
mvpGetAbsolutePointerPel1	1989
mvpGetAbsolutePointer	1455
pobRefine	1331
SUM	141048

v23
pobExpandingSearch<0, 0>	36834
pobUMHSearch<0, 1>	28107
mvdegrainGetFrame<1>	13456
doPobSearchMVs<0, 1>	11239
pobFetchPredictors	6858
pobInterpolatePrediction	5472
pobExpandingSearch<0, 1>	4954
doPobSearchMVs<0, 0>	3511
mvpGetAbsolutePointerPel2	2903
pobHex2Search<0, 1>	2792
pobGetRefBlockU<1>	1970
mvpGetAbsolutePointerPel1	1938
pobGetRefBlockV<1>	1883
mvpGetAbsolutePointer	1606
pobRefine<0, 1>	802
SUM	124325

May 29 '20 01:05 sekrit-twc

Which compiler flags did you use? (And Autotools or Meson?)

May 29 '20 12:05 dubhater

Default autotools build (./configure && make).

May 29 '20 17:05 sekrit-twc

Hmm. The default with Makefile.am is -O2. Meson defaults to -O3. I compiled the v22 and v23 DLLs using Meson. (I don't know about the older ones.) Perhaps that's what makes it slower?

May 30 '20 11:05 dubhater

I did some test with the above script and for me r22 and r23 are slightly faster than r21 (~4%).

GCC 10 builds are ~10% bigger than GCC 9 but just a tiny bit faster (~2%).

On my zen2 CPU I used -march=native -O2 -ftree-vectorize -fdevirtualize-at-ltrans -flto=16 -pipe but -O2 for GCC 10 is slightly different (it includes -finline-functions now).

May 30 '20 13:05 4re

@Boulder08 Here is v23 compiled with -O2 instead of -O3. That's the only difference. Please test again. vapoursynth-mvtools-v23-O2-win64.zip

Jun 06 '20 19:06 dubhater

2500 frames of a test script of analysis and degraining in 16 bits: v23-normal: 12.62 fps v23-O2: 12.07 fps v23-clang build from doom9 : 13.32 fps

So it was definitely slower.

Jun 07 '20 11:06 Boulder08

vapoursynth-mvtools vapoursynth-mvtools copied to clipboard

v22 slower than v21?

vapoursynth-mvtools
vapoursynth-mvtools copied to clipboard