uarch-bench icon indicating copy to clipboard operation
uarch-bench copied to clipboard

`rep stos` appearing in benchmarked region

Open travisdowns opened this issue 7 years ago • 0 comments

If you take a look at the core region of the innermost method in a benchmark in the libpfc case, you find a rep stos call inside the timed region as follows:

  40792a:       shl    rdx,0x20
  40792e:       or     rdx,rax
  407931:       add    QWORD PTR [rbp+0x28],rdx
  407935:       mov    rcx,0x3
  40793c:       rdpmc  
  40793e:       shl    rdx,0x20
  407942:       or     rdx,rax
  407945:       add    QWORD PTR [rbp+0x30],rdx
  407949:       lfence 
  40794c:       mov    rdi,QWORD PTR [rsp]
  407950:       mov    rsi,QWORD PTR [rsp+0x8]
  407955:       call   47f680 <dep_add_rax_rax>
  40795a:       mov    rdi,rbx
  40795d:       mov    rax,r12
  407960:       mov    ecx,0x7
  407965:       rep stos QWORD PTR es:[rdi],rax    <<< this guy
  407968:       lfence 
  40796b:       mov    rcx,0x40000000
  407972:       rdpmc  
  407974:       shl    rdx,0x20
  407978:       or     rdx,rax
  40797b:       add    QWORD PTR [rbx],rdx
  40797e:       mov    rcx,0x40000001
  407985:       rdpmc  
  407987:       shl    rdx,0x20
  40798b:       or     rdx,rax
  40798e:       add    QWORD PTR [rbx+0x8],rdx
  407992:       mov    rcx,0x40000002
  407999:       rdpmc  
  40799b:       shl    rdx,0x20

The code before and after is issuing rdpmc to read the performance counters, and the actual timed called is dep_add_rax_rax, but the presence of the rep stos is unfortunate, since it's slow, invokes microcode and so on. It's there because of:

struct LibpfcNow {
    PFC_CNT cnt[TOTAL_COUNTERS];
    ...

and

static now_t now() {
        LibpfcNow now = {};

which zero-initializes the counter array. The existing macro either add (PFC_END as shown above) or sub from the array location, so we require zero init since otherwise the garbage will be picked up. In principle though the array is just replaced with the current value, so this isn't necessary - we have have a new PFC_ macro which just mov in the absolute value.

In principle, the effect is cancelled out by the use of dummy_bench (or any other bench), but it would still be nice to eliminate all unnecessary code in the benchmarked region, especially rep instructions and those which modify memory.

travisdowns avatar Feb 19 '18 22:02 travisdowns