pcm icon indicating copy to clipboard operation
pcm copied to clipboard

Using the library with much lighter overhead

Open gshanemiller opened this issue 2 years ago • 3 comments

Consider this example I added to a fork of this repository which is basically a redo of supplied c_example.c: https://github.com/rodgarrison/pcm/blob/master/examples1/example1.cpp

the nub of the code:

  PCM.pcm_c_init();
  PCM.pcm_c_start();

  // No memory accessed: no LLC refs or misses
  unsigned s=0;
  for (unsigned i=0; i<10000000; i++) {
    s+=1;
  }
 
 PCM.pcm_c_stop();

run with:

# the last two events count LLC hits, misses
./example1 umask=0x00,event=0x3c umask=0x00,event=0xc0 umask=0x4f,event=0x2e umask=0x41,event=0x2e

Giving this ouput:

test1: lcore: 0 cpu-cycles: 220075 instructions: 156010, instructions/cycle: 0.71, counter0: 220075, counter1: 156010, counter2: 16334, counter3: 720

While there will be sporadic LLC hits/misses, the counts counter2: 16334, counter3: 720 are insanely high. My interpretation is that all of that stuff, probably leaking into instruction and cycle counts, is the overhead of the API which gets loads and loads of data. All that noise makes the reported stats hard to understand.

Is there an example using this library that is way, way lighter in overhead? Usually for this kind of micro benchmarking we're looking to profile with PMU,

  • 1 pinned thread on 1 core only
  • 1-4 counters like cycles, retired instructions, and LLC hit/misses

gshanemiller avatar Apr 30 '22 19:04 gshanemiller

The instruction count (156.010) does not match the order of the iteration count (10.000.000). Likely the thread was migrated? Could you try to pin the thread using https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html

opcm avatar May 04 '22 08:05 opcm

Thank you for taking time to respond. Intel PMUs are a great feature btw.

My error - pinning the thread is required as you rightly point out.

Corrected

# run with hyper-threading disabled

# vendor_id	: GenuineIntel
# cpu family	: 6
# model		: 158
# model name	: Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz [skylake/kabby/coffee]
# stepping	: 13
# microcode	: 0xea
# cpu MHz		: 3400.000
# cache size	: 16384 KB
# cpuid level	: 22

./example1 umask=0x00,event=0x3c umask=0x00,event=0xc0 umask=0x4f,event=0x2e umask=0x41,event=0x2e

=====  Processor information  =====
Linux arch_perfmon flag  : yes
Hybrid processor         : no
IBRS and IBPB supported  : yes
STIBP supported          : yes
Spec arch caps supported : yes
Max CPUID level          : 22
IBRS enabled in the kernel   : yes
STIBP enabled in the kernel  : no
The processor is not susceptible to Rogue Data Cache Load: yes
The processor supports enhanced IBRS                     : yes
INFO: Linux perf interface to program uncore PMUs is present
building core event 'umask=0x00,event=0x3c' counter 0
building core event 'umask=0x00,event=0xc0' counter 1
building core event 'umask=0x4f,event=0x2e' counter 2
building core event 'umask=0x41,event=0x2e' counter 3
 Closed perf event handles
Trying to use Linux perf events...
Successfully programmed on-core PMU using Linux perf
test1: lcore: 5 cpu-cycles: 508199958 instructions: 700409996, instructions/cycle: 1.38, counter0: 508199896, counter1: 700409996, counter2: 24300, counter3: 1003 s=100000000

Now contrast counter 2 and 3 (metrics on LLC hit and misses) to this alternative run on same machine running on a pinned lcore. The only difference to pcm is that the counters are read with rdpmc (not perf-events), and the setup overhead I believe is less:

test4: RAW VALUES
-------------------------------------------------------------------
test4:                : iterations run                                      : 100000000
test4: fixed counter 0: retired instructions                  : 700000187
test4: fixed counter 1: no-halt CPU cycles                  : 543990093
test4: fixed counter 2: ref no-halt CPU cycles            : 371684148
test4: prog  counter 0: LLC references                        : 20
test4: prog  counter 1: LLC misses                               : 4
test4: prog  counter 2: brch instrct retired                  : 100000069
test4: prog  counter 3: brch instrct not-taken retired: 3

I'm guessing the guidance for PCM is to run a no-op ... measure the overhead pcm introduces and then run the code under test a bunch of times and subtract out (or average out) the overhead?

gshanemiller avatar May 07 '22 17:05 gshanemiller

I think the difference is noise-level here.. E.g. for instructions retired it is 0.06%

opcm avatar Jun 21 '22 07:06 opcm