E3SM
E3SM copied to clipboard
ERS.ne4_oQU240.F2010.MACHINE_COMPILER.eam-hommexx is failing COMPARE
The machines I've tried this test with GNU have all failed compare. The DEBUG version passes. I've tested with master of May6th on cori-haswell (with GNU v10.3):
/global/cscratch1/sd/ndk/e3sm_scratch/cori-haswell/ERS.ne4_oQU240.F2010.cori-haswell_gnu.eam-hommexx.20220517_122134_hut7b6
Also same fail with cpu-only nodes of PM that use gnu v11. And I see this test fails on ascent.
Actually, I already had an open issue https://github.com/E3SM-Project/E3SM/issues/4815 about this failure which I will close in favor of this one, but note it was there in Feb.
On anvil_gnu compare passed.
I believe the state is that it fails with gnu on PM, ascent, and cori HSW. It passes with debug on PM and HSW. The restart PR (Q init) does not seem to help.
Yes I also tried adding in the changes from oksanaguba/homme/print-diagn
but did not see different behavior.
I see anvil uses gnu version 8.2
Using nvidia compiler on pm-cpu also shows this test fails compare. So not necessarily specific to GNU.
ERS.ne4_oQU240.F2010.pm-cpu_nvidia.eam-hommexx
Is this the only test that is failing on PM or cori HSW with gnu?
Also, to make it little simpler, would you be able to reproduce this fail on anvil or chrysalis, maybe, by changing version of gnu? Debugging on ANL is easiest.
The test passes on chrysalis ERS.ne4_oQU240.F2010.chrysalis_gnu.eam-hommexx
which uses gnu v9.3
@ndkeen does cori have GNU version 8 or 9 available? If so, would you run the test with the lower GNU version to see if we can figure out if this is a GNU version issue or an arch-related issue? Thanks.
Yes I had assumed that the version of GNU was the issue. Cori does have 8.3, but curiously not 9.
This test still fails on pm-cpu with July 5th master.
This summary is from the test suites unless indicated. Its not just gnu
Fails with: pgi 21.11 (ascent) gnu 9.1 (ascent) ibm 16.1.1 (ascent) gnu 11.2 (crusher) gnu 11.2 (perlmutter cpu) pgi 19.10 (compy) gnu 10.3 (cori-haswell, manual testing)
Passes: intel 19.0.3 (cori-knl, cori-haswell) intel 20.0.4 (chrysalis) gnu 8.2 (anvil) gnu 8.1 (mappy) gnu 9.3 (chrysalis, manual testing)
Gnu only (with architecture) 8.1 PASS (Intel) 8.2 PASS (Intel) 9.1 FAIL (Power9) 9.3 PASS (AMD) 10.3 FAIL (Intel) 11.2 FAIL (AMD)
ERS.ne4_oQU240.F2010.cori-knl_gnu.eam-hommexx
still fails with Oct 25 repo.
Also a fail with 1 thread: ERS_PMx1.ne4_oQU240.F2010.cori-knl_gnu.eam-hommexx
@ambrad or @oksanaguba any thoughts on this?
This is Oksana's task. In the past I have suggested switching this to SMS if this won't be debugged right now.
The test passes on chrysalis for intel and for gnu. I see that Rob put "potential bug" label; my best guess is that if init code pieces are arranged slightly differently, the code would pass on all versions of compilers. We can either 1) remove the test completely, like Andrew suggested, or 2) keep it as is, or 3) debug it.
I would vote for 1), a passing test on some machines is still valuable. As for option 3), I do not see this as a priority at the moment since we have many other scream tasks.
Another option is to convert this test to D, if debug passes on all other machines.
Yes converting it to D would be better then removing it or making it SMS. Are you sure this isn't a real problem? It used to pass everywhere.
I must say I do not remember it passing everywhere. I remember PGI was a problem from the beginning, but I do not remember GNU behavior. It is hard to track tests that do not run everyday and do not have baselines. Also, some of the machines did not run stably at the time corresponding PR went in.
Maybe it is a bug, but this works on some machines and in debug. There was a related issue, when I was debugging forcing logic in hommexx, i tried to make eam+homme and eam+hommexx runs bfb and failed (the verification had to be done via climo runs instead). I assume this is because there is some arrangement in init code that causes nonbfb (standalone homme presumably tests all functional, non-init, code).
Considering that this is not needed for scream, i would leave this as is. Moving it to D on all machines will remove passing opt. build on chrysalis.
What about adding a D test, but keeping the opt. build test too?
But our goal is all-green. Having a test be red forever is not an option.
Actually I just realized this is passing on gcp with gnu 12 so maybe a compiler upgrade on pm-cpu is what we need. In that case, this could stay red while waiting for that.
I just tried the test on alvarez, which is just like pm-cpu and has GNU 12.1 The test still fails compare. All other tests are OK in e3sm_developer.
Noting same compare fail with next of July 18th
I think currently, it looks this test only failing on pm-cpu/nvidia. Fails compare.
Perhaps we can make bless request ? I would just have no idea what changed to allow passes on other machines.
ERS_D.ne4_oQU240.F2010.pm-cpu_nvidia.eam-hommexx
I think the problem is that it fails restart-compare, not just baseline-compare, so blessing the diffs won't help.
Ah. Thanks Andrew. It's been a while since I've looked at this and forgot the issue.
With Oct27th checkout, I still see compare fails
ERS.ne4_oQU240.F2010.pm-cpu_gnu.eam-hommexx
ERS.ne4_oQU240.F2010.pm-cpu_nvidia.eam-hommexx