VapourSynth-BM3D icon indicating copy to clipboard operation
VapourSynth-BM3D copied to clipboard

AVX optimization

Open MonoS opened this issue 8 years ago • 9 comments

Hi Mawen, i've started doing some AVX optimization in your project.

As of MVTools i've not done any profiling, i'm just converting what was already converted in SSE2 and trying to understand how to put my effort.

You can find my commits here https://github.com/MonoS/VapourSynth-BM3D/commit

Now i've reached the file Block.h and started understanding a bit more what kind of processing is BM3D doing, so i have some questions. In (V)BM3D_Basic/Final what parameter is returned by function srcGroup.size()? If i understood correctly it use group_size, am i right?? in that case a simd_step of 8 is perfect In Block.h there are the two function BlockMatching and BlockMatchingMulti, does Height() and Width() refer to block_size? in that case a simd_step of 16 would be a bit to much, resulting in far less performance than the sse2 counterpart [because of the use of only the C path], i already have an idea on how to fix this.

Do you think there are any other function that can help BM3D perform better using AVX [or any previous instruction set]?? In case let me know :)

MonoS avatar Jan 01 '16 20:01 MonoS

Hi MonoS, thanks for your effort. srcGroup.size() is the number of matched blocks, it varies from 1 to group_size. For those blocks with many similar blocks nearby, it should be fixed to group_size though. The Height() and Width() in Block.h is block_size, thus mainly a Width() of 8 is to be optimized.

By the way, I've tried converting to AVX before, but the improvement is very small. HolyWu has also tried AVX+FMA, with the similar outcome. I'm looking forward to if you could do better.

The biggest bottleneck of BM3D is DCT and IDCT right now. FFTW3 implements DCT by converting it to DFT, which is not that efficient.

mawen1250 avatar Jan 02 '16 17:01 mawen1250

On what generation of CPU did you tested it?? IIRC older model than ivy bridge still use 128bit operation on 256bit register resulting in less ucache fill but same performance than sse, a similar problem with AMD CPUs [without counting the bottleneck on the decoder].

I'll try my optimization on Ivy and Haswell, but probably HolyWu already tried it on an Hashwel.

I can also try to make an avx version of the DCT<->IDCT if i get really involved.

MonoS avatar Jan 02 '16 18:01 MonoS

In fact you were right, at least for hard-thresholding there is no benefit for using AVX, probably it's all due to the frontend because while newer instruction may double up the performance, they also double the uops count, i'll study a bit more carefully the kernel [more for study then other, in fact i neither use the basic estimation, mvtools works a lot better], but i don't think it can be improved more.

Instead i noticed a 3-6% improvement using my SSE2 version [and on the fp number for zeroing the sign bit, platform dependent but who don't use IEEE754??], probably the kernel is faster in C using the or [due to short circuiting], but maybe performance can be improved for the C version using the same method. EDIT: It is, 5 cycles faster, doing a commit right now.

The only way to get a bit of performance on this kernel is to use AVX2 _mm256_sub_epi32, in this way we can remove 2 _mm256_extractf128_si256, an _mm256_zeroupper and substitute two _mm_sub_epi32 with a single one 256bit one [BTW, yours optimization made my wuauing quite a bit :D ], but still only 5-6% more.

EDIT2: almost forgot, here the test code http://pastebin.com/kknAEp1C sorry if it is quite messy

MonoS avatar Jan 03 '16 00:01 MonoS

My CPU is Ivy Bridge, HolyWu's should be Haswell. I've only tried AVX on block matching though, so I'm not sure about the filtering part.

As for the fp abs trick, SSE2 version is nice, but the C version violates strict-aliasing so probably I won't use it for safety.

You mean that subbing 0xffffffff in cmp_sum? It's actually meant for adding 1 to accumulate. Thus there might be some tricks to avoid using _mm256_sub_epi32, like adding a 0x00000001 to cmp_sum as float?

And part of the test code's results: batch: 15 1199993702 1199993702 1199993702 Cicli C : 15.103290 Cicli SSE: 9.820708 Cicli AVX: 20.361397 Guadagno SSE/AVX: 0.964640 batch: 16 1274995756 1274995756 1274995756 Cicli C : 15.071425 Cicli SSE: 9.808172 Cicli AVX: 19.776554 Guadagno SSE/AVX: 0.991899 batch: 17 1350001658 1350001658 1350001658 Cicli C : 15.101697 Cicli SSE: 9.710577 Cicli AVX: 20.474184 Guadagno SSE/AVX: 0.948568 batch: 18 1424995634 1424995634 1424995634 Cicli C : 15.126595 Cicli SSE: 9.733429 Cicli AVX: 20.105217 Guadagno SSE/AVX: 0.968249 batch: 19 1499993041 1499993041 1499993041 Cicli C : 15.062799 Cicli SSE: 9.741582 Cicli AVX: 19.677139 Guadagno SSE/AVX: 0.990142

mawen1250 avatar Jan 04 '16 12:01 mawen1250

I've redid the ffabs trick on the c version, now it doesn't break aliasing rules and appear to be also faster than the previous solution.

At least for hard thresholding, as you saw, it only decrease performance for now.

The subbing 0xFFFFFFFF is awesome, it took me about 20 minutes to come up in my mind, id i had yours code as an example. I think something can be done but only if the number of coeff never goes more than 2^24 [the fp mantissa], but then we need a method to transform a 0xFFFFFFFF into 0x1 that may be slower, i should check.

I hope you like my test code, sorry if it's a bit in italian, but it's my native language, i hope it is understandable anyway :)

MonoS avatar Jan 04 '16 20:01 MonoS

http://blog.qt.io/blog/2011/06/10/type-punning-and-strict-aliasing/ If I'm not mistaken, this usage of union still violates stric-aliasing?

More specifically, *fi.i is modified but never referenced afterwards, then the compiler optimization may just skip this assignment.

mawen1250 avatar Jan 05 '16 07:01 mawen1250

Yes, you're right, even if GCC didn't deleted it, some other may do (but i don't think so).

I'll try some thingzzz, and let you know

MonoS avatar Jan 23 '16 17:01 MonoS

I've tried it on VS2015, the result is correct though. By the way, there're some problems in https://github.com/MonoS/VapourSynth-BM3D/commit/421fa782ee60b7a106597d2243fe90d42a247f90, I've fixed them when merging it https://github.com/HomeOfVapourSynthEvolution/VapourSynth-BM3D/commit/126313969ee870e1ea41a6f60f04d2cc9243ae3c.

mawen1250 avatar Jan 25 '16 03:01 mawen1250

Hi,

Something that might assist you guys with the optimization:

https://github.com/fenbf/AwesomePerfCpp

RoyiAvital avatar Jun 18 '17 22:06 RoyiAvital