likwid
likwid copied to clipboard
Unreasonably high values (not deterministic)
See https://groups.google.com/forum/#!topic/likwid-users/m1ElsBTerfk
Thanks @rileyweber13 for reporting
Hey Thomas!
In case we have more problems with the google group and for the convenience of people reading this for the first time, I've included links to the code I'm using to demonstrate this issue. I use the c code likwid_minimal.c and the bash script likwid_minimal_repeated.sh to count the number of failures in a given number of tests.
Anyways, I realized while working on this issue that I had some leftover files in /usr/local
that I think were getting picked up at runtime, and I think that was muddying the water. I removed those, ran make clean
and make distclean
, and then installed likwid to /usr/local/likwid-master
. I did the same with version 4.3.4, but installed it to /usr/local/likwid-v4.3.4/
. I then tested each version with a single group and with multiple groups. Each case was tested both with hyperthreading enabled and with hyperthreading disabled. Since my machine does not provide the option to disable hyperthreading in the bios, I disabled it by using LIKWID_THREADS to select cores 0,1 (which represent full cores and not hyperthreads in my case) and then set the number of openMP threads to be 2. Here are my results:
- Version 4.3.4. Compiled with
g++ likwid_minimal.c -L/usr/local/likwid-v4.3.4/lib -I/usr/local/likwid-v4.3.4/include -llikwid -mtune=native -fopenmp -o likwid_minimal
and ran withLD_LIBRARY_PATH=/usr/local/likwid-v4.3.4/lib PATH=/usr/local/likwid-v4.3.4/sbin:$PATH ./likwid_minimal_repeated.sh
- No hyperthreading, specifying one group (L2): 0 failures out of 100 tests
- With hyperthreading, specifying one group (L2): 0 failures out of 200 tests
- there was one case where I received the error "WARN: Skipping region (null) for evaluation"
- No hyperthreading, specifying multiple groups (L2|L3|FLOPS_SP|FLOPS_DP): 3 failures out of 200 tests
- With hyperthreading, specifying multiple groups (L2|L3|FLOPS_SP|FLOPS_DP): 12 failures out of 100 tests
- full output available here: https://pastebin.com/qcM34Rv6
- next, likwid master branch commit 99b0d23927f5e65cfa4eb5aeac1c57504395694b , compiled with
g++ likwid_minimal.c -L/usr/local/likwid-master/lib -I/usr/local/likwid-master/include -llikwid -mtune=native -fopenmp -o likwid_minimal
and ran withLD_LIBRARY_PATH=/usr/local/likwid-master/lib PATH=/usr/local/likwid-master/sbin:$PATH ./likwid_minimal_repeated.sh
. Results:- No hyperthreading, specifying one group (L2): 0 failures out of 100 tests
- With hyperthreading, specifying one group (L2): 0 failures out of 100 tests
- No hyperthreading, specifying multiple group (L2|L3|FLOPS_SP|FLOPS_DP): 1 failure out of 100 tests
- With hyperthreading, specifying multiple groups (L2|L3|FLOPS_SP|FLOPS_DP):5 failures out of 100 tests
- full output: https://pastebin.com/X5MwUVUq
Hope this helps anyone who looks at this issue. It seems to me that the problem manifests when specifying multiple groups but is exacerbated by hyperthreading.
I've started investigating the likwid code, and with some guidance I might be able to put together a pull request. I'm currently looking at perfmon_setupCounters()
, perfmon_startCounters()
, and perfmon_stopCounters()
. Do you think this is a good place to start? Or is there somewhere else I should look first?
Hi Riley,
thanks for switching to Github as the Google group deletes our posts (marked as spam).
Looking at your results, I would say that there are two problems:
- The first minor problem is the
WARN: Skipping region (null) for evaluation
: It missed storing the string, the region was not executed or similar. No clue about the exact reason because I havn't seen this a long time or never but I realized that both of your loops (c = a * b + c
andcopy()
) don't have a don't-optimize-me-away guard and neitherc
nor the result ofcopy()
is used afterwards. - The second problems are the high values. For debugging, I would recommend to use the "developer" verbosity
perfmon_setVerbosity(3)
and check for the stringOVERFLOW
in the output if high values occur. This would give us an hint whether overflows are actually involved. For investigation, you should look at the architecture-specific code (src/includes/perfmon_<arch>.h
, mainly thereadCounters<arch>()
function which is used inlikwid_marker{Start/Stop}Region
) and atcalculateMarkerResult()
insrc/libperfctr.c
(for non-MarkerAPI runs, it usescalculateResult()
insrc/perfmon.c
which is comparable). My guess would be that in__perfmon_switchActiveGroupThread()
does not reset some variables or similar that cause the problems after the first switch.
Thanks for your interest to investigate this issue.
Thanks for the direction, I'll work on this and let you know if I have questions or progress
Hey Thomas, I feel like I'm in over my head trying to resolve this issue. I've looked at the code for several hours this week, but I think fixing this bug requires a greater familiarity with the codebase than I have time to develop. Moving forward, I'm happy to help provide any information I can, but I won't work on a pull request any further. Following are some discoveries I made, I hope it's helpful!
Something odd I noticed while working on this issue is that when I redirect the output of likwid_minimal
to a local file (e.g. ./likwid_minimal > output.txt
I consistently get many "stopping non-started region" warnings. Even more strangely, these errors go away if I redirect output to /tmp/tmp.txt
which is on the same disk but a separate partition. Not sure if that's helpful, but I thought I'd mention it.
What's the "don't optimize me away" guard you mentioned? I tried a couple tricks to prevent optimization: first I tried printing the final value of c
and printing an element of copy_array
. Next, I tried making both volatile
. Finally, I compiled the code with the flag -O0
to prevent all optimizations. None of these attempts seemed to have an affect on the appearance of the WARN: Skipping region (null) for evaluation
.
Let me know if you'd like any more information about my system or usage
Hi Riley, that's unfortunate but understandable. For me, the whole code looks familiar and I know where to look if anything fails. Nevertheless, as soon as I worked on the topic, I would like to have a short telco/videoconf with you to see whether I have missed anything from your comments (if that's ok for you).
I cannot see how file redirect could influence the runtime but I'll check with your script.
Compilers like to optimize not required stuff away. With printing c
or and an element of copy_array
you are on the safe side because the code regions need to be executed to know the results at printing. I was just coming up with this because your example code didn't contain the "guards" anymore and thus the compiler could decide to optimize them and maybe even a call to LIKWID_MARKER*
away.
Best, Thomas
I'm happy to help with a tele/video conference. Perhaps it'd be best to email me to organize a time? My email is [email protected]
I noticed something today that might be a workaround. I added #pragma omp barrier
immediately before the call to likwid_markerNextGroup()
and after doing so I haven't seen a single case of the unreasonably high values bug over 200 runs of likwid_minimal
.
This change was made at this commit. The new file that seems to be working fine can be viewed here.
Now I realizes what the problem was/is.
LIKWID_MARKER_SWITCH = likwid_markerNextGroup() = perfmon_switchActiveGroup(new_gid)
should be called in a serial region and not in parallel. Therefore, it is probably more suitable to use #pragma omp single
or #pragma omp master
.
Hey Thomas! Sorry for the tardiness, seems like my email notifications aren't working.
I will begin to use likwid_markerNextGroup()
in sequential regions only. However, it seems that placing #pragma omp barrier
immediately before switching groups is still necessary, as removing it caused an extreme number of problems. Every time I ran the minimal example without the barrier, I had both the "stopping non-started region" error and unreasonably high values.
In any case, it works perfectly when the barrier is in place and the group is switched in a serial region. I've updated my minimal example again: see this commit.
Also, I wanted to let you know that the Doxygen documentation for the marker API says that likwid_markerNextGroup
should be called in a parallel region. I will make a pull request tomorrow to update that.
I think I always used OpenMP with its implicit barriers, so putting a barrier in front seems like a good hint. Please add the hint to the Doxygen input in your PR.
I'm not 100% sure whether we should close this issue already. I think there is still an issue in some situations.
Oh, okay! My mistake. I'll let you know if I encounter this issue again.
Well Thomas, it didn't take me long before I found this issue again :joy: I have been working on an updated example for C-internalMarkerAPI.c
. You can see it here. Just like the old example, there are three regions, and two of them are nested within the third. The region that contains the other two is called "Total". When I compile and run this program (using the makefile in ./examples
) the region "Total" only measures the first group and every counter reports 18446740000000000000.000000
for its value.
- Are nested regions supported?
- Is the usage otherwise wrong in the file I linked above?
- Or, is this another instance of this bug?
I'm sure you're busy but I appreciate any help you can provide.
If I put all three regions inside a single for loop the problem goes away. Here's updated code. However, this is significantly different from the original example code; is this still the intended usage?
Hi Riley,
the problem is that during the Total region the event set changes due to LIKWID_MARKER_SWITCH
near the inner regions. Neither START
nor STOP
actually starts/stops the hardware counters. The hardware counters are started in INIT
, START
and STOP
only read the counters and store/process the counts. So, the START(Total)
stores a value and STOP(Total)
reads another value. Normally the stop value should be greater than the stop value, the difference is stored. With SWITCH
the counters are actually stopped, reprogrammed, zeroed and started again. This could lead to problematic effect like the unreasonably high values as the stop count might not be greater than start. There is some overflow detection but there might be non-working edge cases.
In your updated version, the event set is not switched inside any region and therefore works. The example should show the following points:
- Where to put the macros in general (
INIT
andCLOSE
in serial,START
andSTOP
in parallel region) - Regions can be nested (that's why three regions, one normal and two nested regions)
- Regions can be registered beforehand (in a parallel region, possibly with a comment that a barrier is required between
REGISTER(FOO)
the the correspondingSTART(FOO)
. Separate OpenMP parallel region or#pragma omp barrier
is suitable) - Region data can be retrieved with
GET
- Region data can be reset with
RESET
- Event set can be switched with
SWITCH
outside of any region in serial execution (and an implicit or explicit barrier should be used) - Comment that
LIKWID_MARKER_THREADINIT
is commonly not required (and deprecated but that's not documented yet). It is only required if the LIKWID pinning library failed and therefore the threads run not pinned to single hardware threads and could be migrated by the OS scheduler. I don't know of any threading runtime system that does not work at the moment.
Best, Thomas
Hello,
i encountered a similar problem of unreasonably high values and wanted to share my findings. I use the C/C++-API and switch event sets with perfmon_switchActiveGroup()
continuously. I was able to identify the problem. On rare occasions, due to a bug in my application, the time (sleep) between switchting event sets was too small (0 <= x <= 0.005 seconds)
for LIKWID to handle, leading to those extreme values. I guess this would be one of those edge cases...
Thanks for the valuable input.
So you do something like this?
#parallel
{
#barrier
if (thread_id == 0)
LIKWID_MARKER_SWITCH
else
sleep(0 <= x <= 0.005 seconds)
}
or more like this (highly exaggerated):
i++
LIKWID_MARKER_SWITCH
j++
LIKWID_MARKER_SWITCH
I'm trying to come up with some example code for the MarkerAPI Pitfalls page.
My code would roughly translate to
loop {
#parallel
{
#pragma omp barrier
#pragma omp master
LIKWID_MARKER_SWITCH;
#pragma omp barrier
sleep(0 <= x <= 0.005 seconds) // or a very small workload
}
}
giving me the following results (note the maximum DP [MFLOP/s] per core should be around 30):
FLOPS_DP,Runtime (RDTSC) [s],6.33566e-05,6.33566e-05,6.33566e-05,6.33566e-05
FLOPS_DP,Runtime unhalted [s],8.82689e-05,0.000733393,0.000796717,0.000389245
FLOPS_DP,DP [MFLOP/s],17.6146,0.0157837,263.398,11.3642
It's interesting that your code doesn't return correct results. It's basically what I recommended on the Pitfalls page when you have to use LIKWID_MARKER_SWITCH
in a parallel region. The barriers surrounding it should avoid other threads reading the counters during one thread is doing the switch. Do you also have region markers inside your code? Are they around the loop? Or do you run without MarkerAPI but use the switching feature to switch at desired code locations and not in a time-based manner?
Sorry for the confusion. As mentioned i'm using the C/C++ API but converted my example in the last answer to what i believe would be the MarkerAPI/OMP equivalent. The following lines are the actual code producing the wrong results. I'm observing the unreasonably high values when sleeping less than 0.005s. The lines below are running in the background (single threaded, one instance, with LIKWID_FORCE=1) while i'm executing my real application.
perfmon_setupCounters(gids[0]);
double setTime = 5000000; //or less
for (size_t i = 0; i < gids.size(); i++) {
if (i == 0)
perfmon_startCounters();
else
perfmon_switchActiveGroup(gids[i]);
usleep(setTime);
}
perfmon_stopCounters();
I know that such small intervals are not reasonable and precision will suffer. I accidentally used such a low setTime
and was wondering about the high results.
Thanks for the update. I overread that you use the C/C++ API and not the MarkerAPI. Sorry.
It's interesting that it has problems although everything is single-threaded. This is basically what the timeline mode does (but in Lua) but the switching time is commonly not that short (as you noted).
I just checked the code of perfmon_switchActiveGroup
and there is a problem in there (it calls __perfmon_switchActiveGroupThread
for each thread but this function uses calls that process all threads perfmon_stopCounters
, perfmon_setupCounters
and perfmon_startCounters
). Consequently, we are switching for each thread all threads again and again (high overhead) but this shouldn't cause any problems like yours, the registers are just reset to zero multiple times.
The real application you are talking about is not accessing the hardware counters, right? I have to write some test code to investigate this problem further.
The problem occurs independently from the real application being executed. LIKWID is the only one accessing the hardware counters.
You can use this Reproducer with ./likwid <Set1,Set2> <Interval> <Iterations>
. The output that i posted earlier was measured with ./likwid FLOPS_DP 0.005 1
. The most dramatic results occur with an Interval of 0
seconds.
Thanks for supplying a test code. Unfortunately, it segfaults directly in perfmon_init
. I'll sort it out.
I'm sorry to hear that. I tested it on Centos7 and Ubuntu 20.04. With the included Makefile and the latest LIKWID release i did not get any errors.
I'm not sure whether I introduced the bug in the master branch but the problem was that the thread_id -> (core_id|socket_id|numa_id|llc_id) maps were not initialized. This is done in affinity_init()
in the current master (somehow a bad location, I know).
So, I was able to fix the other issue with your Reprocer, now I'm looking at the high values.
Hey Tom! I saw this issue again today and thought my findings may be useful. Running the following command causes one or both of DRAM_READS
and DRAM_WRITES
to report 1.845e19.
likwid-perfctr -C S0:0-3 -g MEM -m -M 1 likwid-bench -t copy_avx -i 100 -w S0:2GB:4
I wasn't sure if this was fixed, so after finding the issue I pulled the latest master and reinstalled likwid, but this did not solve the issue. I also played around with the size and number of iterations, but couldn't seem to stop this issue from occurring. I tried 10 iterations and 200 iterations with no change. I also tested a size of 200MB with no change.
Hope this case provides some insight! Thank you for your continued support.
Hi Riley,
that's some Intel desktop, isn't it? On Intel desktops, the memory interface is accessible only by mmap'ing /dev/mem
and reading specific offsets. The counters are free-running (no start/stop), no overflow indication and only 32 bit in width. They could overflow frequently. I'll check that.
It is an intel machine, but it's a laptop (not sure if that matters...) Is there anything I can do on my end to help avoid this issue?