energy fixer impl for comments
the last commit is bfb on chryalis, no threads, 64,32, 17, and 3 ranks
[ac.onguba@chrlogin2 apr21]$ zgrep hash ~/ess/apr21b/run/e3sm.log.* | grep "738 "
/home/ac.onguba/ess/apr21b/run/e3sm.log.731963.250421-153625.gz: 0: bfbhash> 738 b1fcc15e61c1c6d8 (Hommexx)
/home/ac.onguba/ess/apr21b/run/e3sm.log.731969.250421-160131.gz: 0: bfbhash> 738 b1fcc15e61c1c6d8 (Hommexx)
/home/ac.onguba/ess/apr21b/run/e3sm.log.732649.250422-131834.gz: 0: bfbhash> 738 b1fcc15e61c1c6d8 (Hommexx)
/home/ac.onguba/ess/apr21b/run/e3sm.log.732655.250422-140013.gz: 0: bfbhash> 738 b1fcc15e61c1c6d8 (Hommexx)
here is example of log entries for the fixer
EAMxx:: energy fixer, PB T tendency -0.000845
EAMxx:: energy fixer, total energy before fix 32484931544.607445
EAMxx:: energy fixer, rel energy error after fix 4.32541348833857e-17
PB T tendency is the peanut-buttered value of T tend applied everywhere in the domain.
@bartgol @ambrad @mahf708 -- here is a draft for the fixer (but without 4-columns summation). do you have any suggestions so far?
Here are WIP items so far:
-- Put line 'atm_proc->add_column_conservation_check(conservation_check, fail_handling_type);' in the same or similar if-statement as for column checks (i haven't figure out whether the same if-statements are enough).
-- Decide what to do about 4-columns summation
-- Added 2 new views, could i get away without them? m_new_energy_for_fixer = view_1d<Real> ("new energy fixer", m_num_cols); m_energy_change = view_1d<Real> ("energy change fixer", m_num_cols);
-- Field Layouts/views code is rather cumbersome, but maybe that is common in AD.
-- I haven't tried standalone scream, there is a question what to do about components/eamxx/src/share/util/eamxx_repro_sum_mod.F90 and '#ifdef CAM'.
Going back to this -- CI output looks ok (I will look at formatting suggestions separately). I'd like to keep 3 reviewers' suggestions for the next PR -- combining calls in reprosum, using grid->comm, and renaming PB with something meaningful.
Also, once this is approved, I may need to do a new branch with cleaner history. Do you want to look at this again, @bartgol @ambrad @mahf708 ? I took Luca's advice and used CI testing, and did not run any testing by myself.
Does this feature have an associated test? If not, I recommend adding one.
It's up to you whether to run tests on machines. I personally like to run a good CIME test across machines so there aren't GPU-machine surprises when I integrate one of my PRs.
Re: git history, note you can save this current branch to a new branch in your fork, then rebase this one and force-push it. That will let you keep the same PR and also save this version of the branch.
Re: running tests. As Andrew said, it's up to you. Using CI saves me initial debugging, but if you want to be extra safe, running a CIME case on pm-gpu can be beneficial. Given the nature of your changes, I don't expect any surprise compared to the CI tests, provided that you enabled energy fixer in one of the standalone tests (so that CI did test the fixer on GPU as part of the eamxx-sa workflow).
Re: history. Unless you do want to keep a copy of this branch as is (in which case, follow Andrew advice and create a bkp branche before doing an interactive rebase), I would simply do git rebase -i origin/master and squash/reword/reorg commits as you see fit. Imho, given the size of the PR, something in the range of 4-7 commits seem reasonable.
Edit: I just noticed that there is no standalone test that triggers the fixer. Perhaps it could be a good idea to enable it in one of the tests in tests/multi-process/dynamics_physics/* folders? E.g., homme_shoc_cld_p3_rrtmgp/inputs.yaml already enables col conservation checks, so maybe we can also enable the fixer there? Or does the fixer require surface coupling to be present/active? If that's the case, then maybe running a GPU test before integrating may help reveal any GPU error, since no CI test runs CIME cases on GPU (for now).
thanks, i will try to add a fixer to a standalone test (right now it is not working because probably fluxes are un-inited instead of being zero), but i do not think this is the type of test Andrew meant, because it does not test the fixer itself. I can potentially turn on debug diagn for the fixer in one of CIME tests, and then use python to parse that output like it was done for some SL/energy tests before. @ambrad ?
I just meant that I recommend that this PR include a test that can run in the nightlies. What form it takes is up to you.
For a standalone test, say, homme_shoc_cld_p3_rrtmgp_np1 , dp is not init-ed properly before homme when compute_current_energy is called. we'll try to fix that.
For a standalone test, say,
homme_shoc_cld_p3_rrtmgp_np1, dp is not init-ed properly before homme when compute_current_energy is called. we'll try to fix that.
You can also add a new test to these, call it homme_shoc_cld_p3_rrtmgp_w_fixer or so. In fact, that's what I would recommend doing first. If things are stable and good after a few months, we can remove homme_shoc_cld_p3_rrtmgp and keep homme_shoc_cld_p3_rrtmgp_w_fixer as superseding of the two.
I will take a closer look at the entirety of the PR soon (tmrw or day after) with fresh eyes
dp not initialized would probably be an issue in tests that run not [homme, physics], but [physics, homme], are there such tests? what do "pure physics" tests do?
To rephrase Luca's fix for un-inited dp (for my own benefit) -- since dp on phys grid did not have a time stamp, IO was filling it with FillValue before the 1st call to energy (IO does 1st instant output at init). This behavior will change soon, IO won't be overwriting fields.
PM ERS run passed restart stage:
/pscratch/sd/o/onguba/e3sm_scratch/pm-gpu/ERS_Lh4.ne4_ne4.F2010-SCREAMv1.pm-gpu_gnugpu.eamxx-output-preset-1--eamxx-fixer_debug_output.20250714_121120_6i3eza/run
To rephrase Luca's fix for un-inited dp (for my own benefit) -- since dp on phys grid did not have a time stamp, IO was filling it with FillValue before the 1st call to energy (IO does 1st instant output at init). This behavior will change soon, IO won't be overwriting fields.
PR #7397 was integrated, so the issue regarding un-inited dp should now be resolved even without commit 405e1c5