dudect
dudect copied to clipboard
mfence vs lfence vs cpuid
We've been a bit lazy on how we're using RDTSC. The original piece of code (probably about 10 years ago) had this comment:
Intel actually recommends calling CPUID to serialize the execution flow
and reduce variance in measurement due to out-of-order execution.
We don't do that here yet.
see §3.2.1 http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
That link is gone, but the paper can be found in mirrors. It's a good resource and has the following advice. We should probably just follow it:
Resources:
- https://github.com/oreparaz/dudect/pull/30
- This is how @dgruss does it: https://github.com/IAIK/cache_template_attacks/blob/main/cacheutils.h#L24-L31
uint64_t rdtsc() {
uint64_t a, d;
asm volatile ("mfence");
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
a = (d<<32) | a;
asm volatile ("mfence");
return a;
}
- This is how libcpucycles does it: https://cpucycles.cr.yp.to/libcpucycles-20230115/cpucycles/amd64-tscasm.c.html
long long ticks(void)
{
unsigned long long result;
asm volatile(".byte 15;.byte 49;shlq $32,%%rdx;orq %%rdx,%%rax"
: "=a"(result) :: "%rdx");
return result;
}
- And the motivated reader can go thru Agner Fog's tools and see: https://www.agner.org/optimize/
The test programs use the serializing instruction CPUID before and after reading the time stamp counter in order to prevent out-of-order execution to interfere with the measurements.
More resources
- https://github.com/itzmeanjan/criterion-cycles-per-byte/blob/a270a49652eabf5be9366866613f905f604a18ba/src/lib.rs#L50-L59
- https://github.com/pornin/crrl/blob/9f80f859db9073b5725dc6671e4adbe6299a642c/benches/util.rs#L1-L17
- https://github.com/torvalds/linux/blob/052d534373b7ed33712a63d5e17b2b6cdbce84fd/arch/x86/include/asm/msr.h#L201-L214