E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Fix C balance and FPE errors

Open peterdschwartz opened this issue 1 year ago • 14 comments

Carbon balance was never checked correctly and let through negative values. Initialization of Ecosystem variables needed to be re-adjusted, and balance checking for FATES needed to be changed, which causes round-off differences in CMASS_BALANCE_ERROR for fate_cold_allvars test.

Also included divide-by-zero check in PhenologyMod.F90 to fix fpe in debug mode.

[non-BFB] for one FATES test. Fixes #6120 Fixes #6203 Fixes #6177

peterdschwartz avatar Feb 13 '24 18:02 peterdschwartz

@glemieux Here is a PR to fix the balance checking issues. I did get DIFF for fates_cold_allvars but only for CMASS_BALANCE_CHECK but the values were still ~1E-16.

Let me know if how I implemented the new balance checking makes sense to you.

peterdschwartz avatar Feb 13 '24 18:02 peterdschwartz

After doing a test merge I obtained FILLDIFF for the fates_cold_allvars. That I need to look into today

peterdschwartz avatar Feb 21 '24 15:02 peterdschwartz

Quick update: fates_cold_allvars FAILs at the COMP_base_rest stage. The cprnc file just shows CMASS_BALANCE_ERROR with very small differences. Since there are no other diffs (i.e., in variables used to calculate carbon balance), I wonder if it's just due to floating point rounding issues. But, if that is the case, I'm not sure what the solution should be.

 CMASS_BALANCE_ERROR   (lndgrid,time)  t_index =     10    10
        127      384  (    25,     1) (    93,     1) (   148,     1) (   148,     1)
                 211   0.000000000000000E+00  -2.810582373931857E-16 2.3E-21 -1.292301226966941E-16 2.0E-06 -1.292301226966941E-16
                 211   0.000000000000000E+00  -2.810570991926628E-16         -1.292278595305381E-16         -1.292278595305381E-16
                 384  (    25,     1) (    93,     1)
          avg abs field values:    1.589865621873051E-16    rms diff: 0.0E+00   avg rel diff(npos):  2.0E-06
                                   1.589865886570847E-16                        avg decimal digits(ndif):  5.7 worst:  4.8
 RMS CMASS_BALANCE_ERROR              0.0000E+00            NORMALIZED  0.0000E+00

peterdschwartz avatar Feb 21 '24 23:02 peterdschwartz

What is the diff with? A baseline from the previous code or is it a diff within a test like an ERS test?

rljacob avatar Feb 21 '24 23:02 rljacob

Not baseline diff (though that is also there but makes sense to me). It fails the COMP_base_rest portion of ERP test (same thing with the ERS version I made for debugging).

peterdschwartz avatar Feb 21 '24 23:02 peterdschwartz

Try a REP test.

rljacob avatar Feb 22 '24 00:02 rljacob

REP passes just fine. I'm going to double check that each variable for carbon balance is read/written to the restart file, and maybe a longer simulation will reveal diff's accruing in other fields.

peterdschwartz avatar Feb 22 '24 17:02 peterdschwartz

status: fates_cold_allvars still fails restart test. Tried adding every gridcell and column variable related to CMASS_BALANCE to history and restart files but no other differences. Tried running test for 2 years to accumulate more carbon, but there was no other diffs detected.

Adjusting the history output interval did allow the test to pass the restart comparison but that's not a real solution.

Scheduled meeting with FATES devs to go over the logic for balance checking with FATES enabled.

peterdschwartz avatar Mar 21 '24 16:03 peterdschwartz

@peterdschwartz what is the status of this PR?

rljacob avatar Apr 12 '24 04:04 rljacob

@rljacob I met with @glemieux a couple of weeks ago. He mentioned a possible time-step dependent part of FATES that could be the cause, but I think he's been busy with the FATES api update and our follow up meeting with @thorntonpe was cancelled due to lab holiday.

Today, I'll try increasing the history output to be every timestep and see if that provides any useful information.

peterdschwartz avatar Apr 12 '24 16:04 peterdschwartz

Update:

Increasing output freq finally showed a new diff in a field. Looking into it further, it appears that the vertical transport tendency terms for carbon are not added to the corresponding leaching term due to some issue 5 years ago (according to git blame atleast), but it is for nitrogen and phosphorus.

Hopefully the straightforward solution works here.

 LITR2C_TNDNCY_VERT_TRANS   (lndgrid,levdcmp,time)  t_index =    194   194
          3     5760  (   153,     2,     1) (   153,     1,     1) (   237,     3,     1) (   247,    10,     1)
                3165   1.043116371057806E-09  -1.643725044964128E-09 2.2E-19  2.944499461962624E-12 8.1E-11  1.293930177203629E-16
                3165   1.043116371057806E-09  -1.643725044964128E-09          2.944499678803059E-12          1.293930309552527E-16
                5760  (   153,     2,     1) (   153,     1,     1)
          avg abs field values:    4.034842251976478E-11    rms diff: 0.0E+00   avg rel diff(npos):  8.1E-11
                                   4.034842251976478E-11                        avg decimal digits(ndif):  7.1 worst:  7.0
 RMS LITR2C_TNDNCY_VERT_TRANS         0.0000E+00            NORMALIZED  0.0000E+00

peterdschwartz avatar Apr 12 '24 18:04 peterdschwartz

@peterdschwartz can you point me to the line of code with the git blame that you noted earlier, please?

glemieux avatar Apr 17 '24 05:04 glemieux

@glemieux This Code block is what I was referring to. The transport tendency is not added to carbon leached pool even though the corresponding terms are added for Nitrogen and Phosphorus.

More experimenting showed that bypassing the SoilLittVertTransp calculation, the restart failure goes away. For FATES, the UpdateLitterFluxes is done just prior that adjusts the inputs to the SoilLittVertTransp that could be the cause. Line

peterdschwartz avatar Apr 17 '24 15:04 peterdschwartz

Waiting to see about FATES side of this PR.

rljacob avatar May 02 '24 17:05 rljacob

Since there's a previous fates related issue linked above with the same errors, I'm going to just clean the branch up for the other issues and follow the review comments. The other issue can be used to the issue.

peterdschwartz avatar May 21 '24 19:05 peterdschwartz

Since there's a previous fates related issue linked above with the same errors, I'm going to just clean the branch up for the other issues and follow the review comments. The other issue can be used to the issue.

Sorry for the delay in trying to address the fates-side of this. Just to clarify, you're saying your going to move forwards towards integration @peterdschwartz? If so, that sounds good to me.

glemieux avatar May 22 '24 19:05 glemieux

Since there's a previous fates related issue linked above with the same errors, I'm going to just clean the branch up for the other issues and follow the review comments. The other issue can be used to the issue.

Sorry for the delay in trying to address the fates-side of this. Just to clarify, you're saying your going to move forwards towards integration @peterdschwartz? If so, that sounds good to me.

@glemieux Yes, and now I know how to how to reliably reproduce the issue, we can use #6125 to track progress. I will share a branch with the changes to do it.

@bishtgautam I have rebased the branch and addressed your comments. There's potentially a couple of different ways to handle the else condition in the PhenologyMod, so I went with what I believe to be the simplest and added a comment explaining why. Let me know what you think

e3sm_developer came back [BfB] with the exception of one FATES test that will FAIL restart comparison for the time being.

peterdschwartz avatar May 23 '24 15:05 peterdschwartz

This is ready to merge.

rljacob avatar May 30 '24 17:05 rljacob