libpfc icon indicating copy to clipboard operation
libpfc copied to clipboard

Curious if you've tried the msr kernel module

Open travisdowns opened this issue 8 years ago • 9 comments

There is an existing msr kernel module included in the kernel, which allows you to read and write msrs from userspace (permission dependent) at (for example) /dev/cpu/0/msr. I'm wondering if you've tried it and if it is much slower than your kernel module. If not it might make sense to use that widely used module, or perhaps to offer an option to fall back to that module (since many people may not want to/be allowed to load a random third party module, but may be able to load msr).

On the other hand, you might need more than MSR access to set up the counters? Not sure...

travisdowns avatar Jan 31 '17 22:01 travisdowns

@travisdowns I have not really tried to access the MSRs using the msr LKM. I suppose I could. I had written my own LKM in the expectation of setting everything up in the kernel myself, but as one can see there's a few things I've not discovered how to disable from within the kernel using its APIs, and for that I print the commands to execute into the kernel logs accessible with dmesg:

modprobe -ar iTCO_wdt iTCO_vendor_support
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 2 > /sys/bus/event_source/devices/cpu/rdpmc

I'm not sure if those things can be done with the LKM msr. The first two have to do with disabling the watchdog timers, because they use one of the fixed-function performance counters as their precision interrupt generator. The last one is about enabling rdpmc (which my constant-time inline-assembler macros depends on) in userspace, and might be doable and perhaps is done by accessing an MSR.

obilaniu avatar Jan 31 '17 22:01 obilaniu

Yes, I think only the last of those can be done with the MSR. What I was alluding to though was to use the MSR to do the reads of the performance monitors, but of course I had forgotten that this doesn't happen though any MSR but rather through rdpmc, which I had forgotten.

travisdowns avatar Feb 02 '17 22:02 travisdowns

Following up on "similar ways to do this" I just ran across Andi Kleen's jevents. It uses the perf infrastructure to do self-monitoring, and I think it even allows user mode (i.e., a direct rdpmc call) monitoring since perf_events as a "perf userpage" feature to allow this. I still think it has considerably more overhead than this library though because of the way the userpage stuff works.

travisdowns avatar Jul 03 '17 07:07 travisdowns

@travisdowns On my part I tried to see whether or not it is possible to enable RDPMC reads other than by the perf_events API or by a superuser-priv'ed echo 2 > /sys/bus/event_source/devices/cpu/rdpmc, and it appears that the answer is no. There is a static_key that I must somehow increment that the kernel does not export as a symbol, and thus no one has access to it.

While doing so I happened upon the implementation of the rdpmc userpage. Getting an RDPMC read through it involves a huge overhead. It's not competitive at all with userland RDPMC.

obilaniu avatar Jul 03 '17 07:07 obilaniu

To be clear, the perf_events userpage implementation mostly just involves a userland rdpmc. So that we are on the same page, I'm talking about the code in the cap_user_rdpmc section in this man page:


                  u32 seq, time_mult, time_shift, idx, width;
                  u64 count, enabled, running;
                  u64 cyc, time_offset;

                  do {
                      seq = pc->lock;
                      barrier();
                      enabled = pc->time_enabled;
                      running = pc->time_running;

                      if (pc->cap_usr_time && enabled != running) {
                          cyc = rdtsc();
                          time_offset = pc->time_offset;
                          time_mult   = pc->time_mult;
                          time_shift  = pc->time_shift;
                      }

                      idx = pc->index;
                      count = pc->offset;

                      if (pc->cap_usr_rdpmc && idx) {
                          width = pc->pmc_width;
                          count += rdpmc(idx - 1);
                      }

                      barrier();
                  } while (pc->lock != seq);

As far as I can see, the most expensive thing there is the rdpmc instruction itself. Much of the rest is even optional, and the barrier() is only a compiler barrier on x86 (I think - since we don't have load/load reordering on x86).

travisdowns avatar Jul 03 '17 17:07 travisdowns

On my part I tried to see whether or not it is possible to enable RDPMC reads other than by the perf_events API or by a superuser-priv'ed echo 2 > /sys/bus/event_source/devices/cpu/rdpmc, and it appears that the answer is no.

Yeah, I don't think there will be. We are even lucky we have the echo 2 > ... approach at all, I think. It seems like rdpmc access was locked down to prevent timing-based attacks, especially by "guest" processes in a VM or some type of security sandbox.

FWIW, I spent a long time debugging why I was getting a crash in pfcdemo on the first rdpmc call and it turned out to be due to the /sys/bus/event_source/devices/cpu/rdpmc still being set to 1. I had set it to two but had apparently needed to reboot in the interim. It might make sense to have pfcInit() or whatever check the value (or perhaps check the CR4 bit directly, but that would require pfckmod support) and fail if rdpmc is not enabled.

travisdowns avatar Jul 03 '17 17:07 travisdowns

@travisdowns I just made a push to the repo that does two things...

  • First, it rejects the module load if CR4.PCE is not set (and complains that one should echo 2 > /sys/bus/event_source/devices/cpu/rdpmc). I can't do it myself from within the kernel because no exported symbol in the Linux kernel would allow me to do that.
  • Second, it exposes the counter masks, allowing libpfc to defend against counter wraparound should it happen. I wasn't exactly correct when I said that #9 had no purpose, but now it definitely doesn't because I mask out the bits.

obilaniu avatar Aug 02 '17 21:08 obilaniu

Great! About the first point, I have a change that exposes the CR4.PCE bit in /sys/modules/pfc, so userspace libpfc can check it on init and bail out if it isn't set (rather than getting a SIGSEGV later). I think this is complementary to your load-time check since it helps in the case the bit gets unset later.

travisdowns avatar Aug 02 '17 23:08 travisdowns

@travisdowns I'll merge it if you offer it, make sure to add yourself to AUTHORS.md

obilaniu avatar Aug 02 '17 23:08 obilaniu