aronnax icon indicating copy to clipboard operation
aronnax copied to clipboard

Performance hygiene

Open axch opened this issue 8 years ago • 11 comments

Process

  • [ ] Run a representative example under the profiler at low resolution
  • [ ] Fiddle with the parameters enough to characterize the scaling behavior of various parts of the computation
  • [ ] Predict where the performance cost will be at realistic scale
  • [ ] Investigate any surprises
  • [ ] Repeat for a few more examples, if more information about code coverage is needed

Specific local tweaks that the Internet rumors may help

  • [ ] Adjust iteration order to stride arrays in line with their cache layout
  • [ ] Make sure there is no implicit copying of allocatable arrays
  • [ ] There may be syntactic constructs (FORALL?) that help the compiler get instruction-level parallelism

Not sure of the relationship of this activity to possible rewrites. On the one hand, a rewrite in not-Fortran would obviate the direct benefits of this tweaking, but on the other hand, doing this tweaking provides a more complete baseline to compare any rewrite to, and much infrastructure developed for measuring performance can be reused between tweaks like this and larger algorithmic or platform changes.

axch avatar Mar 10 '17 22:03 axch

If we slightly reframe this issue as "Develop a benchmarking system and profile model performance" then it would give us very useful information as we embark on any rewriting efforts. But, it may be better to split that into two issues.

Having a robust set of benchmarks would give us valuable information about how worthwhile our tweaks turn out to be.

edoddridge avatar Mar 10 '17 22:03 edoddridge

Yes. That overlaps with #36, which calls for a suite of realistic examples to measure; one of the things to measure on it is performance.

axch avatar Mar 11 '17 00:03 axch

As of PR #68, I am somewhat surprised by how much time the toplevel program takes -- on the beta plane bump example, 50% of the runtime in the reduced gravity configuration (at a resolution of 500x500x1). @edoddridge is this what you would expect as well, or are you as surprised as I am? If we seek to investigate this, one avenue would be to break it up into more subroutines. This is beneficial in its own right, and now comes with the added bonus of making the profiler information more precise.

axch avatar Mar 18 '17 15:03 axch

I'm a little surprised, but my intuition is that most of the time is spent writing outputs. Those setups are dumping output every 100 time steps, which is unlikely in a production run.

I'm fully behind a plan to split the main program up into smaller logical units - it's a good plan for a number of reasons.

edoddridge avatar Mar 20 '17 11:03 edoddridge

I know you're currently working towards implementing a LAPACK algorithm for #1, but I did a quick test swapping the loop nesting order in SOR_solver. As might be expected from the results you found in #68, it speeds up that subroutine by ~20%, which in turn means the entire simulation is ~20% faster. Not worth postponing the LAPACK work for, but interesting nonetheless.

edoddridge avatar Mar 20 '17 14:03 edoddridge

Interesting. I assume you mean nesting do i = 1, nx inside do j = 1, ny, in contrast to status quo? For some reason, compiling with gfortran -g -Ofast I see a 10-15% slowdown from doing that (on my machine, testing a 100x100x2 n-layer configuration with 502 time steps). I have confirmed that the loop order has no effect on either the end-to-end answers (within the tolerance in the test suite) or on the number of iterations each solve costs on that example. This is pretty bizarre.

axch avatar Mar 23 '17 14:03 axch

I am very confused by my attempts to find performance tweaks that work on my machine. There should be two improvements to SOR_solve that should just work:

  • Replacing the res array with a scalar, and
  • Reversing the nesting order of the two loops, so that they traverse all the relevant arrays in memory-layout order.

Here's what happens when I try them. I am testing the MIM core on a 100x100x2 n-layer run, for 502 time-steps, in the beta plane bump configuration. Without modifications, perf stat gives me this:

 Performance counter stats for '/home/axch/work/MIM/MIM':

    25,855,032,212 r530010                                                      [20.00%]
    20,426,858,444 L1-dcache-loads                                              [20.02%]
     4,693,351,487 L1-dcache-load-misses     #   22.98% of all L1-dcache hits   [20.04%]
     3,620,578,275 L1-dcache-stores                                             [20.02%]
     1,039,369,168 L1-dcache-store-misses                                       [20.00%]
    97,899,046,284 L1-icache-loads                                              [19.98%]
         2,343,119 L1-icache-misses          #    0.00% of all L1-icache hits   [19.98%]
     2,104,465,477 L1-dcache-prefetches                                         [19.99%]
     2,211,661,182 branch-instructions                                          [20.01%]
        19,784,450 branch-misses             #    0.89% of all branches         [20.01%]

      40.949341088 seconds time elapsed

(The first line is "floating point operations"). As somewhat expected, there is a high rate of L1 data cache misses, consistent with jumping around in memory. Now, I reverse the nesting order of the loops, and I get this:

 Performance counter stats for '/home/axch/work/MIM/MIM':

    25,867,273,484 r530010                                                      [20.00%]
    20,328,333,578 L1-dcache-loads                                              [20.01%]
       476,555,521 L1-dcache-load-misses     #    2.34% of all L1-dcache hits   [20.00%]
     2,099,826,585 L1-dcache-stores                                             [19.99%]
       255,471,777 L1-dcache-store-misses                                       [20.00%]
   106,093,124,630 L1-icache-loads                                              [20.01%]
         2,986,165 L1-icache-misses          #    0.00% of all L1-icache hits   [20.01%]
     1,415,678,273 L1-dcache-prefetches                                         [20.00%]
     2,169,203,636 branch-instructions                                          [20.01%]
        19,661,285 branch-misses             #    0.91% of all branches         [20.01%]

      44.381219694 seconds time elapsed

The data cache is being used much more effectively, as anticipated; both on load and on store. However, the run takes 3.4 seconds longer! And that's because I'm apparently getting 8.2 million more instruction cache loads than before. !?

Double-checked: the number of SOR iterations at each time-step is the same in both runs. This effect also happens if I change the optimization settings from -Ofast to -O2. It also happens, albeit somewhat less in size, if I leave the loop nesting order be and change the residual array to a scalar.

Quite confused; not sure what to do about this now.

axch avatar Mar 25 '17 21:03 axch

My laptop doesn't seem to support measuring L1-icache-loads, but otherwise replicates this result.

axch avatar Mar 25 '17 21:03 axch

Bizarre! I'm afraid I don't have any useful insights to offer.

To confirm - this also shows that we're using 20% of the flops available, yes? In which case we could get at most a 5x speed up?

edoddridge avatar Mar 26 '17 15:03 edoddridge

Yes, for single-core performance, I think that's right.

axch avatar Mar 26 '17 16:03 axch

Ok, that's good to know, and I think it further justifies spending our efforts on implementing an external solver.

edoddridge avatar Mar 26 '17 16:03 edoddridge