E3SM
E3SM copied to clipboard
MVK test failing
The MVK non-bfb test has been failing for a while on Chrysalis.
https://my.cdash.org/test/60574720
Seems to be timing out?
Possibly same problem as https://github.com/E3SM-Project/E3SM/issues/5122 and so should have same solution https://github.com/E3SM-Project/E3SM/pull/5125
@rljacob In my testing, it's almost certainly the same issue. I've been able to successfully run the MVK test with the ~~#5125~~ #5123 fix.
Update: at some point MVK started passing sometimes on Chrysalis. I don't think the #5125 (edit) fix was ever applied so not sure why. But it sill fails I think because its asking for to many nodes and not running.

@rljacob -- we did merge #5123 last December, so it's not that
Looking at latest https://my.cdash.org/test/76691852 Is this relevant?
Model elm no file specified for finidat
Perhaps not. Scrolling back, I see a test failure after running (above didn't run) which had similar warning(?) posted. https://my.cdash.org/test/76461137
But it did take some effort to get the results blessed, because #5123 changed output file names and blessing ti required some special commands from @jgfouca: SES-2269 09/Feb/23 So that's when it started to pass, at least sometimes
And at least some of the FAILs look like:
2023-03-01 05:37:33: BASELINE FAIL for test 'JNextAtm_nbfb20230301_003435'.
Test status: fail; Variables analyzed: 121; Rejecting: 24; Critical value: 13; Ensembles: statistically different
EVV results can be viewed at:
https://web.lcrc.anl.gov/public/e3sm/e3smtest/evv/MVK_PS.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20230301_003435/index.html
The current run of that test is still in the queue:
306069 compute test.MVK e3smtest PD 0:00 90 (Priority)
The number of nodes needs to be reduced. chrysalis is to busy to get 90 nodes every night.
@jonbob I meant #5125 has not been applied to MVK. I edited the comment.
I think test failures are okay, it means that something on next changed. I guess lack of compute nodes is the reason for "not running" then.
Adding @mkstratos to this thread. These tests were tuned for v1 and we are currently working to check its sensitivity with v2 - to ensure that when they fail (when its not timing out), the failure is more likely due to change in climate statistics.
I think there may also be failures due to memleak, like this one which passes statistically: https://my.cdash.org/test/76012860
I thought memleak need not necessarily result in test failure.
Right memleak is not a fail. https://my.cdash.org/test/76012860 should be reported as PASS but I think the string "OLD FAIL" on the last line is being interpreted as a test fail. That's a bug.