VapourSynth-BM3D
VapourSynth-BM3D copied to clipboard
AVX optimization
Hi Mawen, i've started doing some AVX optimization in your project.
As of MVTools i've not done any profiling, i'm just converting what was already converted in SSE2 and trying to understand how to put my effort.
You can find my commits here https://github.com/MonoS/VapourSynth-BM3D/commit
Now i've reached the file Block.h and started understanding a bit more what kind of processing is BM3D doing, so i have some questions. In (V)BM3D_Basic/Final what parameter is returned by function srcGroup.size()? If i understood correctly it use group_size, am i right?? in that case a simd_step of 8 is perfect In Block.h there are the two function BlockMatching and BlockMatchingMulti, does Height() and Width() refer to block_size? in that case a simd_step of 16 would be a bit to much, resulting in far less performance than the sse2 counterpart [because of the use of only the C path], i already have an idea on how to fix this.
Do you think there are any other function that can help BM3D perform better using AVX [or any previous instruction set]?? In case let me know :)
Hi MonoS, thanks for your effort. srcGroup.size() is the number of matched blocks, it varies from 1 to group_size. For those blocks with many similar blocks nearby, it should be fixed to group_size though. The Height() and Width() in Block.h is block_size, thus mainly a Width() of 8 is to be optimized.
By the way, I've tried converting to AVX before, but the improvement is very small. HolyWu has also tried AVX+FMA, with the similar outcome. I'm looking forward to if you could do better.
The biggest bottleneck of BM3D is DCT and IDCT right now. FFTW3 implements DCT by converting it to DFT, which is not that efficient.
On what generation of CPU did you tested it?? IIRC older model than ivy bridge still use 128bit operation on 256bit register resulting in less ucache fill but same performance than sse, a similar problem with AMD CPUs [without counting the bottleneck on the decoder].
I'll try my optimization on Ivy and Haswell, but probably HolyWu already tried it on an Hashwel.
I can also try to make an avx version of the DCT<->IDCT if i get really involved.
In fact you were right, at least for hard-thresholding there is no benefit for using AVX, probably it's all due to the frontend because while newer instruction may double up the performance, they also double the uops count, i'll study a bit more carefully the kernel [more for study then other, in fact i neither use the basic estimation, mvtools works a lot better], but i don't think it can be improved more.
Instead i noticed a 3-6% improvement using my SSE2 version [and on the fp number for zeroing the sign bit, platform dependent but who don't use IEEE754??], probably the kernel is faster in C using the or [due to short circuiting], but maybe performance can be improved for the C version using the same method. EDIT: It is, 5 cycles faster, doing a commit right now.
The only way to get a bit of performance on this kernel is to use AVX2 _mm256_sub_epi32, in this way we can remove 2 _mm256_extractf128_si256, an _mm256_zeroupper and substitute two _mm_sub_epi32 with a single one 256bit one [BTW, yours optimization made my wuauing quite a bit :D ], but still only 5-6% more.
EDIT2: almost forgot, here the test code http://pastebin.com/kknAEp1C sorry if it is quite messy
My CPU is Ivy Bridge, HolyWu's should be Haswell. I've only tried AVX on block matching though, so I'm not sure about the filtering part.
As for the fp abs trick, SSE2 version is nice, but the C version violates strict-aliasing so probably I won't use it for safety.
You mean that subbing 0xffffffff in cmp_sum? It's actually meant for adding 1 to accumulate. Thus there might be some tricks to avoid using _mm256_sub_epi32, like adding a 0x00000001 to cmp_sum as float?
And part of the test code's results: batch: 15 1199993702 1199993702 1199993702 Cicli C : 15.103290 Cicli SSE: 9.820708 Cicli AVX: 20.361397 Guadagno SSE/AVX: 0.964640 batch: 16 1274995756 1274995756 1274995756 Cicli C : 15.071425 Cicli SSE: 9.808172 Cicli AVX: 19.776554 Guadagno SSE/AVX: 0.991899 batch: 17 1350001658 1350001658 1350001658 Cicli C : 15.101697 Cicli SSE: 9.710577 Cicli AVX: 20.474184 Guadagno SSE/AVX: 0.948568 batch: 18 1424995634 1424995634 1424995634 Cicli C : 15.126595 Cicli SSE: 9.733429 Cicli AVX: 20.105217 Guadagno SSE/AVX: 0.968249 batch: 19 1499993041 1499993041 1499993041 Cicli C : 15.062799 Cicli SSE: 9.741582 Cicli AVX: 19.677139 Guadagno SSE/AVX: 0.990142
I've redid the ffabs trick on the c version, now it doesn't break aliasing rules and appear to be also faster than the previous solution.
At least for hard thresholding, as you saw, it only decrease performance for now.
The subbing 0xFFFFFFFF is awesome, it took me about 20 minutes to come up in my mind, id i had yours code as an example. I think something can be done but only if the number of coeff never goes more than 2^24 [the fp mantissa], but then we need a method to transform a 0xFFFFFFFF into 0x1 that may be slower, i should check.
I hope you like my test code, sorry if it's a bit in italian, but it's my native language, i hope it is understandable anyway :)
http://blog.qt.io/blog/2011/06/10/type-punning-and-strict-aliasing/ If I'm not mistaken, this usage of union still violates stric-aliasing?
More specifically, *fi.i is modified but never referenced afterwards, then the compiler optimization may just skip this assignment.
Yes, you're right, even if GCC didn't deleted it, some other may do (but i don't think so).
I'll try some thingzzz, and let you know
I've tried it on VS2015, the result is correct though. By the way, there're some problems in https://github.com/MonoS/VapourSynth-BM3D/commit/421fa782ee60b7a106597d2243fe90d42a247f90, I've fixed them when merging it https://github.com/HomeOfVapourSynthEvolution/VapourSynth-BM3D/commit/126313969ee870e1ea41a6f60f04d2cc9243ae3c.
Hi,
Something that might assist you guys with the optimization:
https://github.com/fenbf/AwesomePerfCpp