covid-sim Checksum fails on Cori. (Haswell/KNL DOE supercomputer)

Checksum fails on Cori. (Haswell/KNL DOE supercomputer)

Open kngott opened this issue 4 years ago • 2 comments

When doing our initial testing, we successfully compiled with Intel and GCC compilers. However, the test case failed on Cori, but was successful on our workstations. We tracked the problem down to the AVX2 and AVX512 instructions in the Cray compiler wrappers:

-march=core-avx2 & -march=core-avx512

When we were tracking this down, we noticed the first divergence happened in P.SpatialBondingBox[3]: It seems to come from roundoff error in SetupModel.cpp, around line 128: P.nch = 4 * ((int)ceil(P.height / P.cheight / 4)); P.height / P.cheight is very close to 7 and P.nch ends up being 7 with avx2 and 8 without.

Note that if one prints the numbers (or inserts a line of std::atomic_memory_fence there), the numbers then agree. Both are 8. However, the code then diverges again elsewhere and the checksum is still a failure.

A general note, the checksum regression test will likely not work across different compilers and hardwares. Even a*x+b could give different answers depending on what the compiler chooses to do: fma or multiplication followed by plus.

Apr 03 '20 17:04 kngott

So I agree the checksum regression test is not a perfect solution. As another example of the issue you highlight, we deliberately run it single-threaded so we don't get variance between different thread performance giving different results between runs. Fused multiply accumulate and vectorization will just add to the issues.

This isn't a problem running the model in full as it is stochastic anyway.

However, we haven't had the time to work out a scalable and maintainable way of running the regression test in a way that allows a small amount of variation, but doesn't let the figures drift over time.

Apr 04 '20 17:04 matt-gretton-dann

c=1 in the regression testing did catch my eye. :)

Until a more general solution to the regression testing is worked out, I guess the best thing to do is just be aware the AVX2 and AVX512 instruction sets are known to cause variations.

Apr 04 '20 23:04 kngott

covid-sim covid-sim copied to clipboard

Checksum fails on Cori. (Haswell/KNL DOE supercomputer)

covid-sim
covid-sim copied to clipboard