E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Use adjust_ps=false in Cxx theta-l_kokkos code

Open tcclevenger opened this issue 2 years ago • 45 comments

For theta-l_kokkos's ForcingFunctor class always use m_adjust_ps=false (and actually remove m_adjust_ps variable). Currently m_adjust_ps is hard-coded true, so any Cxx unit tests using ForcingFunctor will not be BFB.

This exposed an error where the Cxx code was not using the correct dp_adj for computing hydrostatic pressure (code was never tested since it was hard-coded m_adjust_ps=true).

In F90, change definition of adjust_ps to match the Cxx version in unit tests. This makes F90 unit tests non BFB (like Cxx mentioned above), but cxx_vs_f90 is BFB. Also hard-code adjust_ps=true for #ifdef CAM so this PR is not answer changing for EAM runs.

Requested by @mt5555.

[non-BFB] for HOMME unit tests.

tcclevenger avatar Nov 10 '23 18:11 tcclevenger

Maybe we want adjust_ps to be false in v3? @golaz or @wlin7 ?

rljacob avatar Nov 10 '23 18:11 rljacob

This adjust_ps logical has come up in other PRs (https://github.com/E3SM-Project/E3SM/pull/4717) and the comment is always "we're not going to change it for EAM because we don't want to change v2 answers". We're assembling v3 now so time to reconsider that?

Also this issue https://github.com/E3SM-Project/scream/issues/1343 says "E3SM should also migrate to adjust_ps=.false., but this is on hold in order to preserve buggy V2 behavior."

rljacob avatar Nov 10 '23 19:11 rljacob

To Rob's question: Yes, EAM also has this bug and EAM should adopt adjust_ps=.false. But I didn't want to suggest it since it's a small effect and we are so close to the V3 deadlines.

mt5555 avatar Nov 10 '23 23:11 mt5555

Will this be answer changing for SCREAM simulations?

ambrad avatar Nov 13 '23 17:11 ambrad

Will this be answer changing for SCREAM simulations?

When we merge for EAMxx it will be answer changing for C++, not otherwise (#ifdef SCREAM here still has same behavior).

tcclevenger avatar Nov 13 '23 17:11 tcclevenger

Ok, and everyone has agreed to that? I ask because it looks like the old method is being removed from C++, so we won't be able to switch back if there are issues.

ambrad avatar Nov 13 '23 17:11 ambrad

I was about to ask whether we want climo for this change, to document it, even if we can do climo only in eam. I ran climo for this long time ago, so the best would be to rerun https://acme-climate.atlassian.net/wiki/spaces/COM/pages/1632864081/Climo+6-year+runs+for+ps+adjustment+dp3d+adjustment+T+adjustment+choices .

oksanaguba avatar Nov 13 '23 18:11 oksanaguba

Correct - (removing old code). This has been tested in SCREAM v0.1, but then the bug was accidentally turned back on in v1. so it hasn't had extensive testing in SCREAM v1, but it is low risk.

mt5555 avatar Nov 13 '23 18:11 mt5555

Also, can we change the description a little? Something like

This PR addresses the issue that the moist pressure forcing was applied to ps instead of dp3d (design inherited from CAM).
The change affects all runs in standalone homme with forcing, but adjusting ps is hardcoded for EAM runs with F executable for now.
When this is pulled into scream, it will use dp3d adjustment in EAMXX, not ps adjustment.

nonbfb for homme runs with forcings against baselines, should be bfb for cxx-vs-f tests.

@tcclevenger did you run homme suites for this? if so, would you please post the output from chrysalis and weaver here? thanks!

oksanaguba avatar Nov 13 '23 18:11 oksanaguba

Though my comment above is suggestive, i actually think we should rerun climo for this change. If @tcclevenger does not want to, i can do it.

oksanaguba avatar Nov 13 '23 18:11 oksanaguba

@oksanaguba I have run all theta-f* tests on my workstation. I am struggling with weaver at the moment. Have you successfully run there since the drivers and modules were updated? I will continue to try, or move to summit.

And I do not have access to chrysalis.

tcclevenger avatar Nov 13 '23 18:11 tcclevenger

@tcclevenger i haven't used weaver for a while.

oksanaguba avatar Nov 13 '23 19:11 oksanaguba

Waiting on @tcclevenger to test on GPU.

rljacob avatar Dec 07 '23 18:12 rljacob

An update: I now have homme building on weaver, but the F90 theta-f* tests I'm trying to run appear to be hanging, so I'm not able to produce baselines or run the cxx_vs_f90 tests. I'm working on figuring this out.

As for summit, I'm running into an internal compiler error whose solution doesn't seem trivial. I'll most likely continue on the weaver path.

tcclevenger avatar Dec 11 '23 16:12 tcclevenger

Is the Summit ICE for the standalone-Homme build? Does it occur with this configuration?

machinefile=$e3sm/components/homme/cmake/machineFiles/summit-bfb.cmake
cmake -C $machinefile $e3sm/components/homme

This configuration worked for me with the master branch of a few weeks ago; you might check the master branch in addition to your branch.

You should also merge this PR into the SCREAM repo and run the CIME SCREAMv1 test suite, e.g., the specific test ERS_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.

Another point: The ICE might depend on modules. You can source the .env_mach_specific.sh file from a CIME test to get an environment to use when building standalone Homme.

ambrad avatar Dec 11 '23 16:12 ambrad

@oksanaguba @ambrad Looks like my issue on summit was the modules! Loading them through CIME has it building without error. Thanks for the help!

tcclevenger avatar Dec 11 '23 17:12 tcclevenger

Ran the theta-f* tests on summit, all cxx_vs_f90 tests pass, the following tests fail with diffs in 5 of 21 fields (kokkos and f90 are identical).

  • theta-f1-tt10-hvs1-hvst0-r2-qz10-nutopoff-GB-sl-ne2-nu3.4e18-ndays1
    • Diff range (normalized): [4e-15, 2e-11]
  • theta-fhs1-ne2-nu3.4e18-ndays1
    • Diff range (normalized): [4e-10, 2e-6]
  • theta-fhs2-ne2-nu3.4e18-ndays1
    • Diff range (normalized): [1e-9, 2e-4]
  • theta-fhs3-ne2-nu3.4e18-ndays1
    • Diff range (normalized): [1e-10, 2e-6] So all tests give small diffs in same fields.

I also merged into EAMxx and ran AT test suite on weaver and mappy. Weaver passed, Mappy failed with diffs in CIME cases (expected) with normalized diffs in range [2e-5, 1]. I also ran same CIME cases on summit and diffs existed in the same fields as CPU and the same values.

@mt5555 Do these numbers seem reasonable (if that is possible to know), or are there any simulations I need to run to test the output?

@oksanaguba can you run the Chrysalis tests. I don't have access yet.

tcclevenger avatar Dec 13 '23 22:12 tcclevenger

Ran climo for model-vs-model for EAM with the default setup (adjust_ps = true, master branch) and with adjust_ps = false. The climo is here https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/onguba/theta/eam-ne30-dpadj.443759.0002-0006/def/viewer/ , not sure if these diffs are considered small, but they are documented.

oksanaguba avatar Dec 20 '23 21:12 oksanaguba

@oksanaguba Do you know if there are cxx vs. F90 tests using adjust_ps=false in nightly tests?

tcclevenger avatar Dec 21 '23 18:12 tcclevenger

@tcclevenger looking at the code, homme standalone tests with forcing were tested only with adjust_ps = true in F and xx. Related to this and to Mark's comment, when you test this PR, only baseline comparisons should fail, but cxx-vs-F should not fail (both now will be using adjust_ps = false). There is a comment in the PR description that "unit tests are now nonbfb" -- not sure what you mean. Unit tests run smaller chuncks of code and test cxx kernels against F code, often only in the setup that does not cover all possibilities. If you see nonbfb in forcing unit test, that means that 1) these is a bug in cxx code, or 2) params were not set up exactly . Either way it has to be addressed. The same is for cxx-vs-f tests, they shgould always pass. The only fails should be here from comparisons with baselines.

oksanaguba avatar Dec 21 '23 20:12 oksanaguba

@oksanaguba Yes, I tested all the theta-f*ne2 tests. The fails were in

theta-f1-tt10-hvs1-hvst0-r2-qz10-nutopoff-GB-sl-ne2-nu3.4e18-ndays1
theta-f1-tt10-hvs1-hvst0-r2-qz10-nutopoff-GB-sl-kokkos-ne2-nu3.4e18-ndays1
theta-fhs1-ne2-nu3.4e18-ndays1
theta-fhs1-kokkos-ne2-nu3.4e18-ndays1
theta-fhs2-ne2-nu3.4e18-ndays1
theta-fhs2-kokkos-ne2-nu3.4e18-ndays1
theta-fhs3-ne2-nu3.4e18-ndays1
theta-fhs3-kokkos-ne2-nu3.4e18-ndays1

which were all baseline tests. All the cxx_vs_f90 tests pass. I'm confident the F90 matches the C++, but neither match their baselines on master (which is expected).

tcclevenger avatar Dec 21 '23 20:12 tcclevenger

adding some notes from conversation with @oksanaguba

baseline DIFFs expected for all standalone HOMME tests that include forcing and have rsplit>0. Thus all the diffs in the theta-fhs* tests above are expected.

For HOMME tests without forcing (such as the one below) we would expect no differences. However, with this PR the forcing code does some calculations with a different code path, applying a forcing of strength 0. Thus it is possible that it introduces roundoff changes due to the different code path. if the baselines agree on Chrysalis, I think it is safe to conclude this baseline diff on summit is roundoff

theta-f1-tt10-hvs1-hvst0-r2-qz10-nutopoff-GB-sl-ne2-nu3.4e18-ndays1 theta-f1-tt10-hvs1-hvst0-r2-qz10-nutopoff-GB-sl-kokkos-ne2-nu3.4e18-ndays1

mt5555 avatar Dec 21 '23 21:12 mt5555

revised PR looks great.

@oksanaguba will run the test suite on Chrysalis and then I think this is ready to go!

mt5555 avatar Dec 21 '23 23:12 mt5555

An update: There are issues with testing on chrysalis, i am trying to fix forcing ut for now.

oksanaguba avatar Dec 22 '23 21:12 oksanaguba

This PR fails on chrysalis in forcing_ut and in some sl tests. For the forcing test, the issue is not that F and cxx are not bfb (i believe they are), but that answers become nans. Nans appear because hydrostatic pressure is computed in applycam_forcing twice, but in different ways. To compute pprime, first phydro is computed via

  do k=1,nlev
      phydro(:,:,k)=hvcoord%ps0*hvcoord%hyam(k) + ps(:,:)*hvcoord%hybm(k)
   enddo

which uses hybrid coefs. Then phydro is computed via

         ! recompute hydrostatic pressure from dp3d
         call get_hydro_pressure(phydro,elem%state%dp3d(:,:,:,np1),hvcoord)

to add pprime back. The issue is that in unit tests, dp3d, ps, and hybrid coefs are not in sync. In unit tests, pprime is way off if phydro is computed via A, B (because in unit tests random init does not use A and B). When pprime is added back after moisture adjustment, and pprime is sometimes negative, it can lead to negative pressure (because phydro via get_hydro_pressure is very different from the one from A, B).

  1. One obvious solution is to sync random init in unit tests -- that is, use random ps value and compute dp from A and B (right now, dp is randomized instead).

  2. Strategically I think a better solution is to rewrite

phydro(:,:,k)=hvcoord%ps0*hvcoord%hyam(k) + ps(:,:)*hvcoord%hybm(k)

to use dp, not hybrid. This is not a nonbfb change, but we should move away from using hybrid there. Note that overall it is not a bug (yet), because the code in EAM/scream is called when dp is in sync with hybrid coefficients.

I can see that solution #1 will be picked up, so I would ask @tcclevenger to implement it.

I will look at the other fails next.

oksanaguba avatar Jan 02 '24 00:01 oksanaguba

Thanks @oksanaguba! I'll work on implementing solution 1

tcclevenger avatar Jan 02 '24 16:01 tcclevenger

To add why this was not triggered before -- it is because both codes, F and cxx were using only adjust_ps=true option. After this PR forcing_ut will only check adjust_ps=false option.

oksanaguba avatar Jan 02 '24 17:01 oksanaguba

@tcclevenger i forgot to mention one detail: we need this fix in forcing_ut.cpp file, too (this is because in F remap factor by default is -1 and in cxx it is actually 0)

  // Init everything through singleton, which is what happens in normal runs
  auto& c = Context::singleton();
  auto& p = c.create<SimulationParams>();

  p.dt_remap_factor = 1;       <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< THIS

Can you push it into your branch? thanks.

oksanaguba avatar Jan 02 '24 17:01 oksanaguba

Failing tests are theta-f1-tt10-hvs1-hvst0-r2-qz10-nutopoff-GB-sl-ne?-nu3.4e18-ndays1 and fails are against baselines, not cxx-vs-F. These test have zero forcing and they are NH. First i was alarmed that only SL tests fail, but turns out tests theta-f1-tt10-hvs1-hvst0-r2-qz10-nutopoff-GB-ne?-nu3.4e18-ndays1 (not SL), and other no-forcing, NH or HY, not-SL tests do not set use_moisture=true. Turns out use_moisture=true leads to differences in runs with adjust_ps=true and adjust_ps=false in test theta-f1-tt10-hvs1-hvst0-r2-qz10-nutopoff-GB-ne2-nu3.4e18-ndays1 (not SL). use_moisture in the code affects how homme forcings will be computed, because all depend on dp, and if use_moisture=false, then dp/pressures are not recomputed.

Both options, adjust_ps=true and adjust_ps=false, should lead to near zero forcing in homme, but because of how homme uses forcing (converting T tend into homme state tendencies), homme tendencies from zero FQ and FT won't necessarily be identically zero. One issue is that there are nonlinear relationships between FT, FQ and FVtheta and FPhi. I actually could not trace the differences between adjust_ps=true and adjust_ps=false at the very first step (gave up before trying to figure out whether fortran prints all significant digits), but the differences show up at the 2nd call to applycamforcing_tracers.

i am more confident now that the PR behaves as expected, but we should switch to the "wet" option in non-SL tests as well,

diff --git a/components/homme/test/reg_test/namelists/theta.nl b/components/homme/test/reg_test/namelists/theta.nl
index c67923edeb..90876d48e2 100644
--- a/components/homme/test/reg_test/namelists/theta.nl
+++ b/components/homme/test/reg_test/namelists/theta.nl
@@ -35,6 +35,7 @@ hypervis_subcycle_tom  = ${HOMME_TEST_HVS_TOM}
 theta_hydrostatic_mode = ${HOMME_THETA_HY_MODE}
 theta_advect_form = ${HOMME_THETA_FORM}
 tstep_type        = ${HOMME_TTYPE}
+moisture = 'wet'
 /
 &solver_nl
 precon_method = "identity"

oksanaguba avatar Jan 02 '24 23:01 oksanaguba

@oksanaguba What is an appropriate range for randomizing ps values?

I'm backing working on this today so hopefully we will wrap it up soon. Thanks for all your help!

tcclevenger avatar Jan 05 '24 16:01 tcclevenger