ufs-weather-model icon indicating copy to clipboard operation
ufs-weather-model copied to clipboard

gnv1_nested_intel intermittent inability to match against WM baselines on wcoss/hercules/orion

Open zach1221 opened this issue 1 year ago • 13 comments

Description

gnv1_nested_intel will frequently fail to match against baselines on WCOSS2, Orion and Hercules. The test will complete, however will not match ok, even when new baselines are created to ensure any changes to the test are captured.

To Reproduce:

  1. Log into WCOSS2
  2. clone ufs-weather-model develop
  3. adjust rt.conf in ufs-weather-model/tests/ so only gnv1_nested runs.
  4. run test ./rt.sh -a nems -e -l rt.conf

Additional context

Test has failed consistently on WCOSS. Will pass occasionally on Orion and Hercules if run repeatedly. /work2/noaa/stmp/jongkim/stmp/jongkim/FV3_RT/rt_2634567/gnv1_nested_intel

Output

zach1221 avatar Feb 07 '24 23:02 zach1221

@BrianCurtis-NOAA would you be able to add your recent WCOSS experiment path to the additional context section?

zach1221 avatar Feb 07 '24 23:02 zach1221

Hi, @SamuelTrahanNOAA . When you have time, could you help us look into this issue?

zach1221 avatar Feb 09 '24 14:02 zach1221

When did this start happening?

SamuelTrahanNOAA avatar Feb 09 '24 17:02 SamuelTrahanNOAA

For me on WCOSS2, a good few weeks, but it's been intermittent. The baselines are still there, but the test is disabled for WCOSS2 for now. You can re-enable it and compare against those baselines, if that helps.

BrianCurtis-NOAA avatar Feb 09 '24 17:02 BrianCurtis-NOAA

I need to know the specific PR that broke it.

SamuelTrahanNOAA avatar Feb 09 '24 17:02 SamuelTrahanNOAA

I need to know the specific PR that broke it.

I can dig through and find out for you.

zach1221 avatar Feb 09 '24 18:02 zach1221

I need to know the specific PR that broke it.

WM PR#2098 is the furthest back I can find of the test failing to match on the first attempt.

zach1221 avatar Feb 13 '24 15:02 zach1221

PR 2098 changes some NSSL microphysics code. The regression test never uses that code. It is likely that either:

  1. A prior PR broke it, or
  2. This problem has always been there, but we didn't notice it until recently.

Debugging a problem like this is difficult when one cannot run in debug mode. UFS nesting does not work in debug mode. It can't even compile with the GNU compiler due to syntax errors. (For example, using . instead of % to access derived type members.)

SamuelTrahanNOAA avatar Feb 13 '24 22:02 SamuelTrahanNOAA

PR 2098 changes some NSSL microphysics code. The regression test never uses that code. It is likely that either:

  1. A prior PR broke it, or
  2. This problem has always been there, but we didn't notice it until recently.

Debugging a problem like this is difficult when one cannot run in debug mode. UFS nesting does not work in debug mode. It can't even compile with the GNU compiler due to syntax errors. (For example, using . instead of % to access derived type members.)

@SamuelTrahanNOAA Are you ok with me closing this issue? It would mean gnv1_nested remains disabled on wcoss, and hercules/orion.

zach1221 avatar Feb 26 '24 20:02 zach1221

Are you ok with me closing this issue? It would mean gnv1_nested remains disabled on wcoss, and hercules/orion.

No. That regression test must run on all platforms. We must find out why it is failing.

SamuelTrahanNOAA avatar Feb 26 '24 21:02 SamuelTrahanNOAA

Are you ok with me closing this issue? It would mean gnv1_nested remains disabled on wcoss, and hercules/orion.

No. That regression test must run on all platforms. We must find out why it is failing.

Ok, no problem. What do you think is the best way to investigate this further without being able to run in debug?

zach1221 avatar Feb 26 '24 21:02 zach1221

The only way I can think of is to get debug mode working with UFS FV3 nesting.

SamuelTrahanNOAA avatar Feb 26 '24 21:02 SamuelTrahanNOAA

I found two bugs in the nesting:

  • https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/issues/328
  • https://github.com/NOAA-EMC/fv3atm/issues/797

I've got fixes for both of them which I'll PR soon. It's unlikely those will fix this issue since they're both "it crashes or it runs" sorts of bugs.

SamuelTrahanNOAA avatar Mar 11 '24 18:03 SamuelTrahanNOAA