likwid icon indicating copy to clipboard operation
likwid copied to clipboard

Unreasonably high values (not deterministic)

Open TomTheBear opened this issue 4 years ago • 28 comments

See https://groups.google.com/forum/#!topic/likwid-users/m1ElsBTerfk

Thanks @rileyweber13 for reporting

TomTheBear avatar May 28 '20 15:05 TomTheBear

Hey Thomas!

In case we have more problems with the google group and for the convenience of people reading this for the first time, I've included links to the code I'm using to demonstrate this issue. I use the c code likwid_minimal.c and the bash script likwid_minimal_repeated.sh to count the number of failures in a given number of tests.

Anyways, I realized while working on this issue that I had some leftover files in /usr/local that I think were getting picked up at runtime, and I think that was muddying the water. I removed those, ran make clean and make distclean, and then installed likwid to /usr/local/likwid-master. I did the same with version 4.3.4, but installed it to /usr/local/likwid-v4.3.4/. I then tested each version with a single group and with multiple groups. Each case was tested both with hyperthreading enabled and with hyperthreading disabled. Since my machine does not provide the option to disable hyperthreading in the bios, I disabled it by using LIKWID_THREADS to select cores 0,1 (which represent full cores and not hyperthreads in my case) and then set the number of openMP threads to be 2. Here are my results:

  • Version 4.3.4. Compiled with g++ likwid_minimal.c -L/usr/local/likwid-v4.3.4/lib -I/usr/local/likwid-v4.3.4/include -llikwid -mtune=native -fopenmp -o likwid_minimal and ran with LD_LIBRARY_PATH=/usr/local/likwid-v4.3.4/lib PATH=/usr/local/likwid-v4.3.4/sbin:$PATH ./likwid_minimal_repeated.sh
    • No hyperthreading, specifying one group (L2): 0 failures out of 100 tests
    • With hyperthreading, specifying one group (L2): 0 failures out of 200 tests
      • there was one case where I received the error "WARN: Skipping region (null) for evaluation"
    • No hyperthreading, specifying multiple groups (L2|L3|FLOPS_SP|FLOPS_DP): 3 failures out of 200 tests
    • With hyperthreading, specifying multiple groups (L2|L3|FLOPS_SP|FLOPS_DP): 12 failures out of 100 tests
      • full output available here: https://pastebin.com/qcM34Rv6
  • next, likwid master branch commit 99b0d23927f5e65cfa4eb5aeac1c57504395694b , compiled with g++ likwid_minimal.c -L/usr/local/likwid-master/lib -I/usr/local/likwid-master/include -llikwid -mtune=native -fopenmp -o likwid_minimal and ran with LD_LIBRARY_PATH=/usr/local/likwid-master/lib PATH=/usr/local/likwid-master/sbin:$PATH ./likwid_minimal_repeated.sh. Results:
    • No hyperthreading, specifying one group (L2): 0 failures out of 100 tests
    • With hyperthreading, specifying one group (L2): 0 failures out of 100 tests
    • No hyperthreading, specifying multiple group (L2|L3|FLOPS_SP|FLOPS_DP): 1 failure out of 100 tests
    • With hyperthreading, specifying multiple groups (L2|L3|FLOPS_SP|FLOPS_DP):5 failures out of 100 tests
      • full output: https://pastebin.com/X5MwUVUq

Hope this helps anyone who looks at this issue. It seems to me that the problem manifests when specifying multiple groups but is exacerbated by hyperthreading.

I've started investigating the likwid code, and with some guidance I might be able to put together a pull request. I'm currently looking at perfmon_setupCounters(), perfmon_startCounters(), and perfmon_stopCounters(). Do you think this is a good place to start? Or is there somewhere else I should look first?

paigeweber13 avatar Jun 03 '20 20:06 paigeweber13

Hi Riley,

thanks for switching to Github as the Google group deletes our posts (marked as spam).

Looking at your results, I would say that there are two problems:

  • The first minor problem is the WARN: Skipping region (null) for evaluation: It missed storing the string, the region was not executed or similar. No clue about the exact reason because I havn't seen this a long time or never but I realized that both of your loops (c = a * b + c and copy()) don't have a don't-optimize-me-away guard and neither c nor the result of copy() is used afterwards.
  • The second problems are the high values. For debugging, I would recommend to use the "developer" verbosity perfmon_setVerbosity(3) and check for the string OVERFLOW in the output if high values occur. This would give us an hint whether overflows are actually involved. For investigation, you should look at the architecture-specific code (src/includes/perfmon_<arch>.h, mainly the readCounters<arch>() function which is used in likwid_marker{Start/Stop}Region) and at calculateMarkerResult() in src/libperfctr.c (for non-MarkerAPI runs, it uses calculateResult() in src/perfmon.c which is comparable). My guess would be that in __perfmon_switchActiveGroupThread() does not reset some variables or similar that cause the problems after the first switch.

Thanks for your interest to investigate this issue.

TomTheBear avatar Jun 05 '20 11:06 TomTheBear

Thanks for the direction, I'll work on this and let you know if I have questions or progress

paigeweber13 avatar Jun 08 '20 18:06 paigeweber13

Hey Thomas, I feel like I'm in over my head trying to resolve this issue. I've looked at the code for several hours this week, but I think fixing this bug requires a greater familiarity with the codebase than I have time to develop. Moving forward, I'm happy to help provide any information I can, but I won't work on a pull request any further. Following are some discoveries I made, I hope it's helpful!

Something odd I noticed while working on this issue is that when I redirect the output of likwid_minimal to a local file (e.g. ./likwid_minimal > output.txt I consistently get many "stopping non-started region" warnings. Even more strangely, these errors go away if I redirect output to /tmp/tmp.txt which is on the same disk but a separate partition. Not sure if that's helpful, but I thought I'd mention it.

What's the "don't optimize me away" guard you mentioned? I tried a couple tricks to prevent optimization: first I tried printing the final value of c and printing an element of copy_array. Next, I tried making both volatile. Finally, I compiled the code with the flag -O0 to prevent all optimizations. None of these attempts seemed to have an affect on the appearance of the WARN: Skipping region (null) for evaluation.

Let me know if you'd like any more information about my system or usage

paigeweber13 avatar Jun 12 '20 19:06 paigeweber13

Hi Riley, that's unfortunate but understandable. For me, the whole code looks familiar and I know where to look if anything fails. Nevertheless, as soon as I worked on the topic, I would like to have a short telco/videoconf with you to see whether I have missed anything from your comments (if that's ok for you).

I cannot see how file redirect could influence the runtime but I'll check with your script.

Compilers like to optimize not required stuff away. With printing c or and an element of copy_array you are on the safe side because the code regions need to be executed to know the results at printing. I was just coming up with this because your example code didn't contain the "guards" anymore and thus the compiler could decide to optimize them and maybe even a call to LIKWID_MARKER* away.

Best, Thomas

TomTheBear avatar Jun 15 '20 06:06 TomTheBear

I'm happy to help with a tele/video conference. Perhaps it'd be best to email me to organize a time? My email is [email protected]

paigeweber13 avatar Jun 16 '20 15:06 paigeweber13

I noticed something today that might be a workaround. I added #pragma omp barrier immediately before the call to likwid_markerNextGroup() and after doing so I haven't seen a single case of the unreasonably high values bug over 200 runs of likwid_minimal.

This change was made at this commit. The new file that seems to be working fine can be viewed here.

paigeweber13 avatar Jun 18 '20 19:06 paigeweber13

Now I realizes what the problem was/is. LIKWID_MARKER_SWITCH = likwid_markerNextGroup() = perfmon_switchActiveGroup(new_gid) should be called in a serial region and not in parallel. Therefore, it is probably more suitable to use #pragma omp single or #pragma omp master.

TomTheBear avatar Jun 19 '20 07:06 TomTheBear

Hey Thomas! Sorry for the tardiness, seems like my email notifications aren't working.

I will begin to use likwid_markerNextGroup() in sequential regions only. However, it seems that placing #pragma omp barrier immediately before switching groups is still necessary, as removing it caused an extreme number of problems. Every time I ran the minimal example without the barrier, I had both the "stopping non-started region" error and unreasonably high values.

In any case, it works perfectly when the barrier is in place and the group is switched in a serial region. I've updated my minimal example again: see this commit.

Also, I wanted to let you know that the Doxygen documentation for the marker API says that likwid_markerNextGroup should be called in a parallel region. I will make a pull request tomorrow to update that.

paigeweber13 avatar Jun 29 '20 23:06 paigeweber13

I think I always used OpenMP with its implicit barriers, so putting a barrier in front seems like a good hint. Please add the hint to the Doxygen input in your PR.

TomTheBear avatar Jun 30 '20 16:06 TomTheBear

I'm not 100% sure whether we should close this issue already. I think there is still an issue in some situations.

TomTheBear avatar Jul 01 '20 12:07 TomTheBear

Oh, okay! My mistake. I'll let you know if I encounter this issue again.

paigeweber13 avatar Jul 01 '20 16:07 paigeweber13

Well Thomas, it didn't take me long before I found this issue again :joy: I have been working on an updated example for C-internalMarkerAPI.c. You can see it here. Just like the old example, there are three regions, and two of them are nested within the third. The region that contains the other two is called "Total". When I compile and run this program (using the makefile in ./examples) the region "Total" only measures the first group and every counter reports 18446740000000000000.000000 for its value.

  • Are nested regions supported?
  • Is the usage otherwise wrong in the file I linked above?
  • Or, is this another instance of this bug?

I'm sure you're busy but I appreciate any help you can provide.

paigeweber13 avatar Jul 02 '20 22:07 paigeweber13

If I put all three regions inside a single for loop the problem goes away. Here's updated code. However, this is significantly different from the original example code; is this still the intended usage?

paigeweber13 avatar Jul 02 '20 23:07 paigeweber13

Hi Riley,

the problem is that during the Total region the event set changes due to LIKWID_MARKER_SWITCH near the inner regions. Neither START nor STOP actually starts/stops the hardware counters. The hardware counters are started in INIT, START and STOP only read the counters and store/process the counts. So, the START(Total) stores a value and STOP(Total) reads another value. Normally the stop value should be greater than the stop value, the difference is stored. With SWITCH the counters are actually stopped, reprogrammed, zeroed and started again. This could lead to problematic effect like the unreasonably high values as the stop count might not be greater than start. There is some overflow detection but there might be non-working edge cases.

In your updated version, the event set is not switched inside any region and therefore works. The example should show the following points:

  • Where to put the macros in general (INIT and CLOSE in serial, START and STOP in parallel region)
  • Regions can be nested (that's why three regions, one normal and two nested regions)
  • Regions can be registered beforehand (in a parallel region, possibly with a comment that a barrier is required between REGISTER(FOO) the the corresponding START(FOO). Separate OpenMP parallel region or #pragma omp barrier is suitable)
  • Region data can be retrieved with GET
  • Region data can be reset with RESET
  • Event set can be switched with SWITCH outside of any region in serial execution (and an implicit or explicit barrier should be used)
  • Comment that LIKWID_MARKER_THREADINIT is commonly not required (and deprecated but that's not documented yet). It is only required if the LIKWID pinning library failed and therefore the threads run not pinned to single hardware threads and could be migrated by the OS scheduler. I don't know of any threading runtime system that does not work at the moment.

Best, Thomas

TomTheBear avatar Jul 04 '20 00:07 TomTheBear

Hello,

i encountered a similar problem of unreasonably high values and wanted to share my findings. I use the C/C++-API and switch event sets with perfmon_switchActiveGroup() continuously. I was able to identify the problem. On rare occasions, due to a bug in my application, the time (sleep) between switchting event sets was too small (0 <= x <= 0.005 seconds) for LIKWID to handle, leading to those extreme values. I guess this would be one of those edge cases...

ntippman avatar Jul 30 '20 07:07 ntippman

Thanks for the valuable input.

So you do something like this?

#parallel
{
    #barrier
    if (thread_id == 0)
        LIKWID_MARKER_SWITCH
    else
        sleep(0 <= x <= 0.005 seconds)
}

or more like this (highly exaggerated):

i++
LIKWID_MARKER_SWITCH
j++
LIKWID_MARKER_SWITCH

I'm trying to come up with some example code for the MarkerAPI Pitfalls page.

TomTheBear avatar Jul 30 '20 09:07 TomTheBear

My code would roughly translate to

loop {
    #parallel
    {
       #pragma omp barrier
       #pragma omp master
       LIKWID_MARKER_SWITCH;
       #pragma omp barrier
       sleep(0 <= x <= 0.005 seconds) // or a very small workload
    }
}

giving me the following results (note the maximum DP [MFLOP/s] per core should be around 30):

FLOPS_DP,Runtime (RDTSC) [s],6.33566e-05,6.33566e-05,6.33566e-05,6.33566e-05
FLOPS_DP,Runtime unhalted [s],8.82689e-05,0.000733393,0.000796717,0.000389245
FLOPS_DP,DP [MFLOP/s],17.6146,0.0157837,263.398,11.3642

ntippman avatar Jul 30 '20 09:07 ntippman

It's interesting that your code doesn't return correct results. It's basically what I recommended on the Pitfalls page when you have to use LIKWID_MARKER_SWITCH in a parallel region. The barriers surrounding it should avoid other threads reading the counters during one thread is doing the switch. Do you also have region markers inside your code? Are they around the loop? Or do you run without MarkerAPI but use the switching feature to switch at desired code locations and not in a time-based manner?

TomTheBear avatar Jul 30 '20 11:07 TomTheBear

Sorry for the confusion. As mentioned i'm using the C/C++ API but converted my example in the last answer to what i believe would be the MarkerAPI/OMP equivalent. The following lines are the actual code producing the wrong results. I'm observing the unreasonably high values when sleeping less than 0.005s. The lines below are running in the background (single threaded, one instance, with LIKWID_FORCE=1) while i'm executing my real application.

perfmon_setupCounters(gids[0]);
double setTime = 5000000; //or less
for (size_t i = 0; i < gids.size(); i++) {
    if (i == 0) 
        perfmon_startCounters();
    else 
        perfmon_switchActiveGroup(gids[i]);
    
    usleep(setTime);
}
perfmon_stopCounters();

I know that such small intervals are not reasonable and precision will suffer. I accidentally used such a low setTime and was wondering about the high results.

ntippman avatar Jul 30 '20 12:07 ntippman

Thanks for the update. I overread that you use the C/C++ API and not the MarkerAPI. Sorry.

It's interesting that it has problems although everything is single-threaded. This is basically what the timeline mode does (but in Lua) but the switching time is commonly not that short (as you noted).

I just checked the code of perfmon_switchActiveGroup and there is a problem in there (it calls __perfmon_switchActiveGroupThread for each thread but this function uses calls that process all threads perfmon_stopCounters, perfmon_setupCounters and perfmon_startCounters). Consequently, we are switching for each thread all threads again and again (high overhead) but this shouldn't cause any problems like yours, the registers are just reset to zero multiple times.

The real application you are talking about is not accessing the hardware counters, right? I have to write some test code to investigate this problem further.

TomTheBear avatar Jul 30 '20 13:07 TomTheBear

The problem occurs independently from the real application being executed. LIKWID is the only one accessing the hardware counters.

You can use this Reproducer with ./likwid <Set1,Set2> <Interval> <Iterations>. The output that i posted earlier was measured with ./likwid FLOPS_DP 0.005 1. The most dramatic results occur with an Interval of 0 seconds.

ntippman avatar Jul 31 '20 07:07 ntippman

Thanks for supplying a test code. Unfortunately, it segfaults directly in perfmon_init. I'll sort it out.

TomTheBear avatar Jul 31 '20 10:07 TomTheBear

I'm sorry to hear that. I tested it on Centos7 and Ubuntu 20.04. With the included Makefile and the latest LIKWID release i did not get any errors.

ntippman avatar Jul 31 '20 11:07 ntippman

I'm not sure whether I introduced the bug in the master branch but the problem was that the thread_id -> (core_id|socket_id|numa_id|llc_id) maps were not initialized. This is done in affinity_init() in the current master (somehow a bad location, I know). So, I was able to fix the other issue with your Reprocer, now I'm looking at the high values.

TomTheBear avatar Jul 31 '20 16:07 TomTheBear

Hey Tom! I saw this issue again today and thought my findings may be useful. Running the following command causes one or both of DRAM_READS and DRAM_WRITES to report 1.845e19.

likwid-perfctr -C S0:0-3 -g MEM -m -M 1 likwid-bench -t copy_avx -i 100 -w S0:2GB:4

I wasn't sure if this was fixed, so after finding the issue I pulled the latest master and reinstalled likwid, but this did not solve the issue. I also played around with the size and number of iterations, but couldn't seem to stop this issue from occurring. I tried 10 iterations and 200 iterations with no change. I also tested a size of 200MB with no change.

Hope this case provides some insight! Thank you for your continued support.

paigeweber13 avatar Aug 25 '20 17:08 paigeweber13

Hi Riley,

that's some Intel desktop, isn't it? On Intel desktops, the memory interface is accessible only by mmap'ing /dev/mem and reading specific offsets. The counters are free-running (no start/stop), no overflow indication and only 32 bit in width. They could overflow frequently. I'll check that.

TomTheBear avatar Aug 26 '20 10:08 TomTheBear

It is an intel machine, but it's a laptop (not sure if that matters...) Is there anything I can do on my end to help avoid this issue?

paigeweber13 avatar Aug 26 '20 19:08 paigeweber13