E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

MVK test failing

Open rljacob opened this issue 3 years ago • 2 comments

The MVK non-bfb test has been failing for a while on Chrysalis.

https://my.cdash.org/test/60574720

Seems to be timing out?

rljacob avatar Aug 18 '22 20:08 rljacob

Possibly same problem as https://github.com/E3SM-Project/E3SM/issues/5122 and so should have same solution https://github.com/E3SM-Project/E3SM/pull/5125

rljacob avatar Aug 18 '22 20:08 rljacob

@rljacob In my testing, it's almost certainly the same issue. I've been able to successfully run the MVK test with the ~~#5125~~ #5123 fix.

mkstratos avatar Aug 18 '22 20:08 mkstratos

Update: at some point MVK started passing sometimes on Chrysalis. I don't think the #5125 (edit) fix was ever applied so not sure why. But it sill fails I think because its asking for to many nodes and not running.

Screenshot 2023-03-29 at 12 08 27 PM

rljacob avatar Mar 29 '23 17:03 rljacob

@rljacob -- we did merge #5123 last December, so it's not that

jonbob avatar Mar 29 '23 17:03 jonbob

Looking at latest https://my.cdash.org/test/76691852 Is this relevant?

Model elm no file specified for finidat

Perhaps not. Scrolling back, I see a test failure after running (above didn't run) which had similar warning(?) posted. https://my.cdash.org/test/76461137

sarats avatar Mar 29 '23 17:03 sarats

But it did take some effort to get the results blessed, because #5123 changed output file names and blessing ti required some special commands from @jgfouca: SES-2269 09/Feb/23 So that's when it started to pass, at least sometimes

jonbob avatar Mar 29 '23 17:03 jonbob

And at least some of the FAILs look like:

2023-03-01 05:37:33: BASELINE FAIL for test 'JNextAtm_nbfb20230301_003435'.
    Test status: fail; Variables analyzed: 121; Rejecting: 24; Critical value: 13; Ensembles: statistically different
    EVV results can be viewed at:
        https://web.lcrc.anl.gov/public/e3sm/e3smtest/evv/MVK_PS.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20230301_003435/index.html

jonbob avatar Mar 29 '23 17:03 jonbob

The current run of that test is still in the queue: 306069 compute test.MVK e3smtest PD 0:00 90 (Priority)

The number of nodes needs to be reduced. chrysalis is to busy to get 90 nodes every night.

rljacob avatar Mar 29 '23 17:03 rljacob

@jonbob I meant #5125 has not been applied to MVK. I edited the comment.

rljacob avatar Mar 29 '23 17:03 rljacob

I think test failures are okay, it means that something on next changed. I guess lack of compute nodes is the reason for "not running" then.

sarats avatar Mar 29 '23 17:03 sarats

Adding @mkstratos to this thread. These tests were tuned for v1 and we are currently working to check its sensitivity with v2 - to ensure that when they fail (when its not timing out), the failure is more likely due to change in climate statistics.

salilmahajan avatar Mar 29 '23 18:03 salilmahajan

I think there may also be failures due to memleak, like this one which passes statistically: https://my.cdash.org/test/76012860

mkstratos avatar Mar 29 '23 18:03 mkstratos

I thought memleak need not necessarily result in test failure.

sarats avatar Mar 29 '23 19:03 sarats

Right memleak is not a fail. https://my.cdash.org/test/76012860 should be reported as PASS but I think the string "OLD FAIL" on the last line is being interpreted as a test fail. That's a bug.

rljacob avatar Mar 29 '23 23:03 rljacob