flint icon indicating copy to clipboard operation
flint copied to clipboard

Optimize fft_small for Intel CPUs

Open fredrik-johansson opened this issue 2 years ago • 2 comments

According to Daniel, vroundpd is a bottleneck on Intel:

I noticed that things go noticably faster on my intel slab if the round function is replaced by basic arithmetic. I though for sure rounding couldn't be slower than add/sub, but, sure enough, intel have done it.

The cycle latencies on recent amd and intel chips are:

          amd       intel
round:    3         8
add/sub:  3         4
mul:      3         4
fmadd:    4         4

fredrik-johansson avatar Dec 07 '23 04:12 fredrik-johansson

I looked into the generated code and it does not look that great (at least on Skylake). By unrolling and doing other stuff directly after roundings, we can latency penalties. Currently it wants to do a vroundpd directly followed by a vfnmadd132pd acting on the same register.

I couldn't guide the compiler to do what I wanted it to do, so I think we have to resort to inline assembly here.

Edit: GCC and Clang generate different sequences (the text above refers to GCC), but both are not optimal and cannot seem to be guided.

albinahlback avatar Feb 21 '24 01:02 albinahlback

In relation to #1832, it would be nice to implement different subroutines in fft_small based on register width (128 bits for NEON, 256 bits for AVX2, etc.) and the number of such registers. Will probably be more efficient as I haven't seen GCC optimize enough IMO. For AVX-512, there is the instruction VCVTTPD2QQ to convert from float to double, as compared to vroundpd.

albinahlback avatar Mar 14 '24 00:03 albinahlback