Optimize fft_small for Intel CPUs
According to Daniel, vroundpd is a bottleneck on Intel:
I noticed that things go noticably faster on my intel slab if the round function is replaced by basic arithmetic. I though for sure rounding couldn't be slower than add/sub, but, sure enough, intel have done it.
The cycle latencies on recent amd and intel chips are:
amd intel
round: 3 8
add/sub: 3 4
mul: 3 4
fmadd: 4 4
I looked into the generated code and it does not look that great (at least on Skylake). By unrolling and doing other stuff directly after roundings, we can latency penalties. Currently it wants to do a vroundpd directly followed by a vfnmadd132pd acting on the same register.
I couldn't guide the compiler to do what I wanted it to do, so I think we have to resort to inline assembly here.
Edit: GCC and Clang generate different sequences (the text above refers to GCC), but both are not optimal and cannot seem to be guided.
In relation to #1832, it would be nice to implement different subroutines in fft_small based on register width (128 bits for NEON, 256 bits for AVX2, etc.) and the number of such registers. Will probably be more efficient as I haven't seen GCC optimize enough IMO. For AVX-512, there is the instruction VCVTTPD2QQ to convert from float to double, as compared to vroundpd.