E3SM ERS.ne4_oQU240.F2010.MACHINE_COMPILER.eam-hommexx is failing COMPARE

The machines I've tried this test with GNU have all failed compare. The DEBUG version passes. I've tested with master of May6th on cori-haswell (with GNU v10.3):

/global/cscratch1/sd/ndk/e3sm_scratch/cori-haswell/ERS.ne4_oQU240.F2010.cori-haswell_gnu.eam-hommexx.20220517_122134_hut7b6

Also same fail with cpu-only nodes of PM that use gnu v11. And I see this test fails on ascent.

Actually, I already had an open issue https://github.com/E3SM-Project/E3SM/issues/4815 about this failure which I will close in favor of this one, but note it was there in Feb.

May 17 '22 23:05 ndkeen

On anvil_gnu compare passed.

May 17 '22 23:05 oksanaguba

I believe the state is that it fails with gnu on PM, ascent, and cori HSW. It passes with debug on PM and HSW. The restart PR (Q init) does not seem to help.

May 17 '22 23:05 oksanaguba

Yes I also tried adding in the changes from oksanaguba/homme/print-diagn but did not see different behavior.

I see anvil uses gnu version 8.2

May 17 '22 23:05 ndkeen

Using nvidia compiler on pm-cpu also shows this test fails compare. So not necessarily specific to GNU. ERS.ne4_oQU240.F2010.pm-cpu_nvidia.eam-hommexx

May 18 '22 16:05 ndkeen

Is this the only test that is failing on PM or cori HSW with gnu?

Also, to make it little simpler, would you be able to reproduce this fail on anvil or chrysalis, maybe, by changing version of gnu? Debugging on ANL is easiest.

May 18 '22 20:05 oksanaguba

The test passes on chrysalis ERS.ne4_oQU240.F2010.chrysalis_gnu.eam-hommexx which uses gnu v9.3

May 19 '22 00:05 ndkeen

@ndkeen does cori have GNU version 8 or 9 available? If so, would you run the test with the lower GNU version to see if we can figure out if this is a GNU version issue or an arch-related issue? Thanks.

May 19 '22 01:05 ambrad

Yes I had assumed that the version of GNU was the issue. Cori does have 8.3, but curiously not 9.

May 19 '22 02:05 ndkeen

This test still fails on pm-cpu with July 5th master.

Jul 05 '22 22:07 ndkeen

This summary is from the test suites unless indicated. Its not just gnu

Fails with: pgi 21.11 (ascent) gnu 9.1 (ascent) ibm 16.1.1 (ascent) gnu 11.2 (crusher) gnu 11.2 (perlmutter cpu) pgi 19.10 (compy) gnu 10.3 (cori-haswell, manual testing)

Passes: intel 19.0.3 (cori-knl, cori-haswell) intel 20.0.4 (chrysalis) gnu 8.2 (anvil) gnu 8.1 (mappy) gnu 9.3 (chrysalis, manual testing)

Gnu only (with architecture) 8.1 PASS (Intel) 8.2 PASS (Intel) 9.1 FAIL (Power9) 9.3 PASS (AMD) 10.3 FAIL (Intel) 11.2 FAIL (AMD)

Sep 21 '22 17:09 rljacob

ERS.ne4_oQU240.F2010.cori-knl_gnu.eam-hommexx still fails with Oct 25 repo.

Also a fail with 1 thread: ERS_PMx1.ne4_oQU240.F2010.cori-knl_gnu.eam-hommexx

Oct 25 '22 18:10 ndkeen

@ambrad or @oksanaguba any thoughts on this?

Dec 06 '22 16:12 rljacob

This is Oksana's task. In the past I have suggested switching this to SMS if this won't be debugged right now.

Dec 06 '22 17:12 ambrad

The test passes on chrysalis for intel and for gnu. I see that Rob put "potential bug" label; my best guess is that if init code pieces are arranged slightly differently, the code would pass on all versions of compilers. We can either 1) remove the test completely, like Andrew suggested, or 2) keep it as is, or 3) debug it.

I would vote for 1), a passing test on some machines is still valuable. As for option 3), I do not see this as a priority at the moment since we have many other scream tasks.

Dec 06 '22 18:12 oksanaguba

Another option is to convert this test to D, if debug passes on all other machines.

Dec 06 '22 18:12 oksanaguba

Yes converting it to D would be better then removing it or making it SMS. Are you sure this isn't a real problem? It used to pass everywhere.

Dec 06 '22 19:12 rljacob

I must say I do not remember it passing everywhere. I remember PGI was a problem from the beginning, but I do not remember GNU behavior. It is hard to track tests that do not run everyday and do not have baselines. Also, some of the machines did not run stably at the time corresponding PR went in.

Maybe it is a bug, but this works on some machines and in debug. There was a related issue, when I was debugging forcing logic in hommexx, i tried to make eam+homme and eam+hommexx runs bfb and failed (the verification had to be done via climo runs instead). I assume this is because there is some arrangement in init code that causes nonbfb (standalone homme presumably tests all functional, non-init, code).

Considering that this is not needed for scream, i would leave this as is. Moving it to D on all machines will remove passing opt. build on chrysalis.

Dec 06 '22 19:12 oksanaguba

What about adding a D test, but keeping the opt. build test too?

Dec 06 '22 19:12 oksanaguba

But our goal is all-green. Having a test be red forever is not an option.

Dec 06 '22 19:12 rljacob

Actually I just realized this is passing on gcp with gnu 12 so maybe a compiler upgrade on pm-cpu is what we need. In that case, this could stay red while waiting for that.

Dec 06 '22 19:12 rljacob

I just tried the test on alvarez, which is just like pm-cpu and has GNU 12.1 The test still fails compare. All other tests are OK in e3sm_developer.

Dec 06 '22 20:12 ndkeen

Noting same compare fail with next of July 18th

Jul 18 '23 18:07 ndkeen

I think currently, it looks this test only failing on pm-cpu/nvidia. Fails compare. Perhaps we can make bless request ? I would just have no idea what changed to allow passes on other machines. ERS_D.ne4_oQU240.F2010.pm-cpu_nvidia.eam-hommexx

Jul 27 '23 17:07 ndkeen

I think the problem is that it fails restart-compare, not just baseline-compare, so blessing the diffs won't help.

Jul 27 '23 17:07 ambrad

Ah. Thanks Andrew. It's been a while since I've looked at this and forgot the issue.

Jul 27 '23 17:07 ndkeen

With Oct27th checkout, I still see compare fails

ERS.ne4_oQU240.F2010.pm-cpu_gnu.eam-hommexx
ERS.ne4_oQU240.F2010.pm-cpu_nvidia.eam-hommexx

Oct 27 '23 22:10 ndkeen

E3SM E3SM copied to clipboard

ERS.ne4_oQU240.F2010.MACHINE_COMPILER.eam-hommexx is failing COMPARE

E3SM
E3SM copied to clipboard