uarch-bench
uarch-bench copied to clipboard
`rep stos` appearing in benchmarked region
If you take a look at the core region of the innermost method in a benchmark in the libpfc case, you find a rep stos
call inside the timed region as follows:
40792a: shl rdx,0x20
40792e: or rdx,rax
407931: add QWORD PTR [rbp+0x28],rdx
407935: mov rcx,0x3
40793c: rdpmc
40793e: shl rdx,0x20
407942: or rdx,rax
407945: add QWORD PTR [rbp+0x30],rdx
407949: lfence
40794c: mov rdi,QWORD PTR [rsp]
407950: mov rsi,QWORD PTR [rsp+0x8]
407955: call 47f680 <dep_add_rax_rax>
40795a: mov rdi,rbx
40795d: mov rax,r12
407960: mov ecx,0x7
407965: rep stos QWORD PTR es:[rdi],rax <<< this guy
407968: lfence
40796b: mov rcx,0x40000000
407972: rdpmc
407974: shl rdx,0x20
407978: or rdx,rax
40797b: add QWORD PTR [rbx],rdx
40797e: mov rcx,0x40000001
407985: rdpmc
407987: shl rdx,0x20
40798b: or rdx,rax
40798e: add QWORD PTR [rbx+0x8],rdx
407992: mov rcx,0x40000002
407999: rdpmc
40799b: shl rdx,0x20
The code before and after is issuing rdpmc
to read the performance counters, and the actual timed called is dep_add_rax_rax
, but the presence of the rep stos
is unfortunate, since it's slow, invokes microcode and so on. It's there because of:
struct LibpfcNow {
PFC_CNT cnt[TOTAL_COUNTERS];
...
and
static now_t now() {
LibpfcNow now = {};
which zero-initializes the counter array. The existing macro either add
(PFC_END
as shown above) or sub
from the array location, so we require zero init since otherwise the garbage will be picked up. In principle though the array is just replaced with the current value, so this isn't necessary - we have have a new PFC_
macro which just mov
in the absolute value.
In principle, the effect is cancelled out by the use of dummy_bench
(or any other bench), but it would still be nice to eliminate all unnecessary code in the benchmarked region, especially rep
instructions and those which modify memory.