Results 308 comments of Travis Downs

@kobalicek - oops, good point I forgot that `bt` is totally read only. Here's another one I noticed: blendvpd xmm, xmm, xmm0 : Lat: 0.50 Rcp: 0.50 I also got...

> Yeah I'm also getting 0.2 reciprocal throughput on some instructions on Ryzen, but apparently Ryzen is capable of executing 5 instructions per cycle if they are in uop cache....

My first CNL results look all wrong: ``` add r8, r8 : Lat: 0.66 Rcp: 0.20 add r8, i8 : Lat: 0.66 Rcp: 0.20 add r16, r16 : Lat: 0.66...

Yes, but `rdtscp` measures wall-clock time, not cycles. So it will always be wrong (in cycles) if the chip has turbo.

The "fix" is either to force the user to turbo off turbo, you can see how I do this programatically here: https://github.com/travisdowns/uarch-bench/blob/master/uarch-bench.sh#L66 Or to do a calibration that allows you...

Yeah many moons ago, there was no frequency scaling (neither turbo nor anti-turbo, i.e., scaling below the nominal freq) so `rdtsc` and real cycles were always the same. Then there...

@kobalicek - my experience with uarch-bench indicates that the calibration approach is fairly robust. At most you sometimes get a wrong calibration due to a wrong assumption: e.g., when I...

BTW, running now in parallel on SKX, SKL and CNL, results should be available in a few more minutes. FWIW here's the script I used which might be useful for...

> BTW don't wanna waste more of your time on this. I would have to fix the timing issues if I want better numbers, as I really didn't know it...

[SKL_rounded.txt](https://github.com/asmjit/cult/files/3283286/SKL_rounded.txt)