E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

ERS.ne4_oQU240.F2010.MACHINE_COMPILER.eam-hommexx is failing COMPARE

Open ndkeen opened this issue 2 years ago • 10 comments

The machines I've tried this test with GNU have all failed compare. The DEBUG version passes. I've tested with master of May6th on cori-haswell (with GNU v10.3):

/global/cscratch1/sd/ndk/e3sm_scratch/cori-haswell/ERS.ne4_oQU240.F2010.cori-haswell_gnu.eam-hommexx.20220517_122134_hut7b6

Also same fail with cpu-only nodes of PM that use gnu v11. And I see this test fails on ascent.

Actually, I already had an open issue https://github.com/E3SM-Project/E3SM/issues/4815 about this failure which I will close in favor of this one, but note it was there in Feb.

ndkeen avatar May 17 '22 23:05 ndkeen

On anvil_gnu compare passed.

oksanaguba avatar May 17 '22 23:05 oksanaguba

I believe the state is that it fails with gnu on PM, ascent, and cori HSW. It passes with debug on PM and HSW. The restart PR (Q init) does not seem to help.

oksanaguba avatar May 17 '22 23:05 oksanaguba

Yes I also tried adding in the changes from oksanaguba/homme/print-diagn but did not see different behavior.

I see anvil uses gnu version 8.2

ndkeen avatar May 17 '22 23:05 ndkeen

Using nvidia compiler on pm-cpu also shows this test fails compare. So not necessarily specific to GNU. ERS.ne4_oQU240.F2010.pm-cpu_nvidia.eam-hommexx

ndkeen avatar May 18 '22 16:05 ndkeen

Is this the only test that is failing on PM or cori HSW with gnu?

Also, to make it little simpler, would you be able to reproduce this fail on anvil or chrysalis, maybe, by changing version of gnu? Debugging on ANL is easiest.

oksanaguba avatar May 18 '22 20:05 oksanaguba

The test passes on chrysalis ERS.ne4_oQU240.F2010.chrysalis_gnu.eam-hommexx which uses gnu v9.3

ndkeen avatar May 19 '22 00:05 ndkeen

@ndkeen does cori have GNU version 8 or 9 available? If so, would you run the test with the lower GNU version to see if we can figure out if this is a GNU version issue or an arch-related issue? Thanks.

ambrad avatar May 19 '22 01:05 ambrad

Yes I had assumed that the version of GNU was the issue. Cori does have 8.3, but curiously not 9.

ndkeen avatar May 19 '22 02:05 ndkeen

This test still fails on pm-cpu with July 5th master.

ndkeen avatar Jul 05 '22 22:07 ndkeen

This summary is from the test suites unless indicated. Its not just gnu

Fails with: pgi 21.11 (ascent) gnu 9.1 (ascent) ibm 16.1.1 (ascent) gnu 11.2 (crusher) gnu 11.2 (perlmutter cpu) pgi 19.10 (compy) gnu 10.3 (cori-haswell, manual testing)

Passes: intel 19.0.3 (cori-knl, cori-haswell) intel 20.0.4 (chrysalis) gnu 8.2 (anvil) gnu 8.1 (mappy) gnu 9.3 (chrysalis, manual testing)

Gnu only (with architecture) 8.1 PASS (Intel) 8.2 PASS (Intel) 9.1 FAIL (Power9) 9.3 PASS (AMD) 10.3 FAIL (Intel) 11.2 FAIL (AMD)

rljacob avatar Sep 21 '22 17:09 rljacob

ERS.ne4_oQU240.F2010.cori-knl_gnu.eam-hommexx still fails with Oct 25 repo.

Also a fail with 1 thread: ERS_PMx1.ne4_oQU240.F2010.cori-knl_gnu.eam-hommexx

ndkeen avatar Oct 25 '22 18:10 ndkeen

@ambrad or @oksanaguba any thoughts on this?

rljacob avatar Dec 06 '22 16:12 rljacob

This is Oksana's task. In the past I have suggested switching this to SMS if this won't be debugged right now.

ambrad avatar Dec 06 '22 17:12 ambrad

The test passes on chrysalis for intel and for gnu. I see that Rob put "potential bug" label; my best guess is that if init code pieces are arranged slightly differently, the code would pass on all versions of compilers. We can either 1) remove the test completely, like Andrew suggested, or 2) keep it as is, or 3) debug it.

I would vote for 1), a passing test on some machines is still valuable. As for option 3), I do not see this as a priority at the moment since we have many other scream tasks.

oksanaguba avatar Dec 06 '22 18:12 oksanaguba

Another option is to convert this test to D, if debug passes on all other machines.

oksanaguba avatar Dec 06 '22 18:12 oksanaguba

Yes converting it to D would be better then removing it or making it SMS. Are you sure this isn't a real problem? It used to pass everywhere.

rljacob avatar Dec 06 '22 19:12 rljacob

I must say I do not remember it passing everywhere. I remember PGI was a problem from the beginning, but I do not remember GNU behavior. It is hard to track tests that do not run everyday and do not have baselines. Also, some of the machines did not run stably at the time corresponding PR went in.

Maybe it is a bug, but this works on some machines and in debug. There was a related issue, when I was debugging forcing logic in hommexx, i tried to make eam+homme and eam+hommexx runs bfb and failed (the verification had to be done via climo runs instead). I assume this is because there is some arrangement in init code that causes nonbfb (standalone homme presumably tests all functional, non-init, code).

Considering that this is not needed for scream, i would leave this as is. Moving it to D on all machines will remove passing opt. build on chrysalis.

oksanaguba avatar Dec 06 '22 19:12 oksanaguba

What about adding a D test, but keeping the opt. build test too?

oksanaguba avatar Dec 06 '22 19:12 oksanaguba

But our goal is all-green. Having a test be red forever is not an option.

rljacob avatar Dec 06 '22 19:12 rljacob

Actually I just realized this is passing on gcp with gnu 12 so maybe a compiler upgrade on pm-cpu is what we need. In that case, this could stay red while waiting for that.

rljacob avatar Dec 06 '22 19:12 rljacob

I just tried the test on alvarez, which is just like pm-cpu and has GNU 12.1 The test still fails compare. All other tests are OK in e3sm_developer.

ndkeen avatar Dec 06 '22 20:12 ndkeen

Noting same compare fail with next of July 18th

ndkeen avatar Jul 18 '23 18:07 ndkeen

I think currently, it looks this test only failing on pm-cpu/nvidia. Fails compare. Perhaps we can make bless request ? I would just have no idea what changed to allow passes on other machines. ERS_D.ne4_oQU240.F2010.pm-cpu_nvidia.eam-hommexx

ndkeen avatar Jul 27 '23 17:07 ndkeen

I think the problem is that it fails restart-compare, not just baseline-compare, so blessing the diffs won't help.

ambrad avatar Jul 27 '23 17:07 ambrad

Ah. Thanks Andrew. It's been a while since I've looked at this and forgot the issue.

ndkeen avatar Jul 27 '23 17:07 ndkeen

With Oct27th checkout, I still see compare fails

ERS.ne4_oQU240.F2010.pm-cpu_gnu.eam-hommexx
ERS.ne4_oQU240.F2010.pm-cpu_nvidia.eam-hommexx

ndkeen avatar Oct 27 '23 22:10 ndkeen