ufs-weather-model icon indicating copy to clipboard operation
ufs-weather-model copied to clipboard

feature test issues for rrfs_smoke_conus13km_hrrr_warm

Open junwang-noaa opened this issue 2 years ago • 21 comments

Description

PR #1195 added a feature test rrfs_smoke_conus13km_hrrr_warm using suite file FV3_HRRR_smoke. The test owner needs to confirm that the feature test can reproduce results with different threads, decomposition, mpi tasks and in restart mode. It can also run in debug mode. Currently the test failed with decomposition and debug test.

To Reproduce:

Check out the branch in PR#1195, run rrfs_smoke_conus13km_hrrr_warm with different threading, decomposition, mpi tasks, in restart mode and debug mode.

Additional context

Add any other context about the problem here. Directly reference any issues or PRs in this or other repositories that this is related to, and describe how they are related. Example:

  • a related issue #953 on RAP test.

Output

junwang-noaa avatar May 17 '22 15:05 junwang-noaa

The issue was fixed in PR#1257. The issue will be closed.

junwang-noaa avatar Jul 13 '22 13:07 junwang-noaa

@junwang-noaa This was NOT fixed in #1257. Please re-open this issue so I don't have to make a new one.

SamuelTrahanNOAA avatar Jul 13 '22 14:07 SamuelTrahanNOAA

Sorry, I see the PR #1257 fixed the reproducibility for hrrr_control, not rrfs_smoke_conus13km_hrrr_warm.

junwang-noaa avatar Jul 13 '22 14:07 junwang-noaa

Actually, the hrrr_control variants already worked, they just weren't enabled. The reproducibility fix in that PR was for the rap_decomp.

SamuelTrahanNOAA avatar Jul 13 '22 14:07 SamuelTrahanNOAA

Can this issue be closed @junwang-noaa @SamuelTrahanNOAA ?

DeniseWorthen avatar Aug 14 '22 19:08 DeniseWorthen

No. This problem is not resolved.

SamuelTrahanNOAA avatar Aug 15 '22 10:08 SamuelTrahanNOAA

I can fix the debug and 2threads variants in this PR: https://github.com/ufs-community/ufs-weather-model/pull/1437 Sadly, as yet, I have no fix for the restart or decomp variants.

However, I suspect this bug may be breaking decomp: https://github.com/ufs-community/ufs-weather-model/issues/1436 if it is using data from halo regions. I have no way to fix that bug, nor even confirm my suspicions, since that code goes well beyond my understanding of the boundary generation.

SamuelTrahanNOAA avatar Sep 22 '22 18:09 SamuelTrahanNOAA

I decided to test rrfs_smoke_conus13km_hrrr_warm with the various features decomposition, restart mode, and mpi, (I know debug and 2threads should now be passing with the merging of #1437 ) and it seems everything passed. @SamuelTrahanNOAA have you had the opportunity to test again recently?

zach1221 avatar Jun 02 '23 17:06 zach1221

They fail for me. How did you test?

You need to use the tests/tests files, not just change environment variables. The RRFS tests ignore several environment variables, and they're always warm starts.

SamuelTrahanNOAA avatar Jun 02 '23 17:06 SamuelTrahanNOAA

The RRFS has hard-coded values for some variables. If you're using an automated tool that tweaks variables, it won't test anything.

These values are hard-coded:

export INPES=12
export JNPES=12
export WARM_START=.true.

All RRFS runs are warm starts.

To do a restart test, you need to set RRFS_RESTART=YES. For a decomposition test, you need a different tests/tests file with different values for INPES and JNPES.

SamuelTrahanNOAA avatar Jun 02 '23 18:06 SamuelTrahanNOAA

I just retested hera.gnu and I can confirm the situation is unchanged. I'd like to know how @zach1221 ran the tests. This is not the first time someone has configured the RRFS tests incorrectly and falsely reported that the restart and decomp work. Is the tool "opnReqTest?" If so, I'll add an "if" statement to rrfs_warm_run.IN to abort the test if that tool is enabled.

SamuelTrahanNOAA avatar Jun 02 '23 18:06 SamuelTrahanNOAA

@SamuelTrahanNOAA I see. Well I guess I tested incorrectly. I was just running the tests sequentially out of rt.conf in tests/. Like, ./rt.sh -a nems -n rrfs_smoke_conus13km_hrrr_warm_debug_decomp intel or ./rt.sh -a nems rrfs_smoke_conus13km_hrrr_warm_restart, etc.

I'll try again with the steps you provided to reproduce. Thank you!

zach1221 avatar Jun 02 '23 18:06 zach1221

The I haven't tried that before.

SamuelTrahanNOAA avatar Jun 02 '23 18:06 SamuelTrahanNOAA

Use this:

COMPILE | 13 | intel | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_RAP_sfcdiff,FV3_HRRR,FV3_HRRR_flake,FV3_RRFS_v1beta,FV3_RRFS_v1nssl -D32BIT=ON | | fv3 |

RUN | rrfs_smoke_conus13km_hrrr_warm                    |                            | baseline |
RUN | rrfs_smoke_conus13km_hrrr_warm_2threads           |                            |          |
RUN | rrfs_conus13km_hrrr_warm                          |                            | baseline |
RUN | rrfs_smoke_conus13km_radar_tten_warm              |                            | baseline |
RUN | rrfs_smoke_conus13km_hrrr_warm_decomp            |                            |          |
RUN | rrfs_smoke_conus13km_hrrr_warm_restart           |                            |          | rrfs_smoke_conus13km_hrrr_warm
RUN | rrfs_conus13km_hrrr_warm_restart_mismatch         |                            | baseline | rrfs_conus13km_hrrr_warm

SamuelTrahanNOAA avatar Jun 02 '23 18:06 SamuelTrahanNOAA

@SamuelTrahanNOAA thanks, again. Let me try that now.

zach1221 avatar Jun 02 '23 18:06 zach1221

My branch was not up-to-date with develop, so that test didn't check if the latest version works. It seems the regression test system has changed substantially. I'll have to check if it's even running those tests correctly.

SamuelTrahanNOAA avatar Jun 02 '23 18:06 SamuelTrahanNOAA

The 2threads test doesn't use 2 threads anymore, but the decomp test still changes the decomposition.

SamuelTrahanNOAA avatar Jun 02 '23 18:06 SamuelTrahanNOAA

The restart and decomp do not match the control, but they are executed correctly.

It looks like the 2threads is using ESMF to turn on threading, without providing the mandatory OMP_NUM_THREADS variable that sets the maximum number of threads available to ESMF. I will try correcting this and see if it still passes.

SamuelTrahanNOAA avatar Jun 02 '23 19:06 SamuelTrahanNOAA

The 2threads test still passes if I set OMP_NUM_THREADS (THRD) to 2

SamuelTrahanNOAA avatar Jun 02 '23 19:06 SamuelTrahanNOAA

The debug_decomp test (rrfs_smoke_conus13km_hrrr_warm_debug_decomp_intel) also fails.

SamuelTrahanNOAA avatar Jun 02 '23 20:06 SamuelTrahanNOAA