openmc OpenMP performance with atomic updates

I performed some scaling tests (OpenMP, no MPI) with up to 50 cores and found a degradation when using all cores. The runtime is even longer than with the optimal core count. I tracked down the root cause to be the atomic updates of some variables. I used OpenMC Version 0.12.2 for my tests, the compiler is gcc 9.3.0 (Ubuntu 20.4 lts) and I used no tallies in my problem (see below). I patched the code to be able to compare three different implementations. (patch attached)

a) atomic update (omp atomic) b) std::vector with a size of omp_get_max_threads() and an update operator+= { data[omp_get_thread_num()]+= v; } c) an additional omp threadprivate variable, that gets updated in the parallel section.

I benchmarked a eigenvalue and a fixed source calculation from a common simple geometry. Results from the three implementations are:

Atomic eigenvalue : 71.61s speedup: 1.00 fixed source : 287.96s speedup: 1.00

std::vector eigenvalue : 43.46s speedup: 1.65 fixed source : 41.33s speedup: 6.92

omp threadprivate eigenvalue : 43.75s speedup: 1.64 fixed source : 35.21s speedup: 8.13

I guess the threadprivate solution is the best (on NUMA architectures, the variables might be allocated my OpenMP on a node-local memory, whereas the std::vector is allocated in one small memory block. This might even give cache issues).
The speedup for eigenvalue calculations is not as high as for fixed source because here the storage of source points in the thread aware array where also atomic operations serializes the threads. A solution could be a per thread (temporary) fission_bank.
Atomic operations also occur in the tally updates, which were not relevant in the tests here. A per thread accumulator might be too memory consuming (I already have a preliminary implementation of this and it works very well for "small" tallies.) Another approach might be a per thread tally event cache, that is only applied to the tally accumulator every N tally scores.
Also note, that the large speedup in my test for the fixed source problem is only achieved, if I define the static variable in IndependentSource::sample as threadprivate. Otherwise, the speedup is only 1.70 / 1.14. This in fact gives a different behavior of the code. (aka a bug!!) It is my guess, that OpenMP introduces an implicit atomic operation for the static variable. (at least the gcc implementation) I did not find that behavior described in the OpenMP specs. Other Compilers (icc, clang) might do it differently and if somebody has infos on that, I would like to hear about it.

Attached you find the patch to OpenMC 0.12.2 I used for experimenting (set the THREAD_LOCAL_MODE in openmc/include/threadprivate.h and recompile to change the modes) and my script for benchmarking.

openmc-0.12.2-atomic.patch.gz sphere_base.py.gz

Aug 23 '21 10:08 ojschumann

The std::vector approach is giving you false sharing. If you created a custom data type that is guaranteed to take up a cache line for each entry, it should be more effective. I'm curious to see if this works better! For instance your container might look like:

template<typename T>
class ThreadData {
...
public:
      T& operator=(T v) {
        if (omp_in_parallel()) {
          data[omp_get_thread_num()].value = v;
        } else {
          // loop over data and set correct union member to v
        }
       return data[omp_get_thread_num()].value;
    }

private:
  // Takes up a whole cache line MOST of the time
  union TakesCacheLine {
    double bufferspace[8];
    T value;
  };
  std::vector<TakesCacheLine> data;
};

As for the static variable in IndependentSource, we should consider simply removing it. The desired behavior is to quit the program after a large number of potential source sites are rejected. If we simply defined the error to happen on a thread private basis rather than the total number of samples on all threads, this error likely wouldn't happen.

That being said, you're likely seeing lackluster scaling here because this problem is incredibly simple. For larger problems, substantially more work is done between the OMP atomics, and you should see better scaling. Even just adding tallies should show you nicer scaling.

Aug 23 '21 19:08 gridley

Hi, I tried your approach on the fixed source case, but with very limited success. The problem I see is a large jitter on the runtimes that I attribute to my setup. (Unfortunately my compute node resides on a virtual machine, so cores can be rescheduled by the VMWare which will make the cacheline optimization nonsense. In fact I can see a performance degradation, if I start some computations on a the login VM residing on the same server. So not the best setup to make such performance comparisons.)

So I run the test cases 12 times in a row, discarded the first two runs and report mean and std times:

Atomic version: 321.15 +/- 2.68 Memory version: 33.66 +/- 0.58 Cacheline version: 33.28 +/- 0.66 threadprivate version: 32.75 +/- 0.44

So the implemented version is definitely worse than the other three versions. Memory, Cachline and threadprivate versions are (within the uncertainties) the same.

I would definitely agree, that a static variable within a function is not a good coding style and it should be removed.

Concerning the scaling, you are right. With larger problems and tallies, the scaling should get better. But with more threads it will get worse again. The next computing nodes, that we consider might have 80 cores (160 with hyperthreading). Using an alternative implementation to the actual omp atomic will give you some improvements for my simple problem but will have no performance implications all other problems. So I would love to see one of the implementations be merged to opemMC. And the problem I posted is not that made up. I'm working on a transient problem, where I call openMC some hundred times with very little actual work in the main loop.

Concerning the semantics, my alternative versions are a bit different.

The Memory / Cacheline version is quite "invisible" for the programmer, as only the declaration of the corresponding variables change. No need to update the code in the main loop. On the other hand, the accumulation in the conversion operator is tailored to use case in openMC and can be the source of errors if used within the parallel section.

The threadprivate version needs additional copyin/copyout logic and operations within the parallel sections have to be performed on a different variable. This is clearly a source for errors!

Aug 25 '21 10:08 ojschumann

Happy to help out on this too, I've access to a 768 core SGI UV machine so could scan through 1 to 768 cores, it's also got 4 Tb of memory

From: ojschumann @.> Sent: Wednesday, August 25, 2021 11:29:24 AM To: openmc-dev/openmc @.> Cc: Subscribed @.***> Subject: Re: [openmc-dev/openmc] OpenMP performance with atomic updates (#1874)