Reproducibility problem with cnmatrix first occurs in dev172 going from ccs_config_cesm0.0.88 to ...89
A cnmatrix reproducibility problem first occurs in dev172 and more specifically when going from ccs_config_cesm0.0.88 to ccs_config_cesm0.0.89. This is the only difference that I see when comparing 89 to 88:
git diff ccs_config_cesm0.0.88 ccs_config_cesm0.0.89
diff --git a/machines/derecho/config_machines.xml b/machines/derecho/config_machines.xml
index 633fe74..6f9d6fd 100644
--- a/machines/derecho/config_machines.xml
+++ b/machines/derecho/config_machines.xml
@@ -21,11 +21,11 @@
<MPI_GPU_WRAPPER_SCRIPT>get_local_rank</MPI_GPU_WRAPPER_SCRIPT>
<PROJECT_REQUIRED>TRUE</PROJECT_REQUIRED>
<mpirun mpilib="default">
- <executable>mpiexec</executable>
+ <executable>mpibind</executable>
<arguments>
<arg name="label"> --label</arg>
- <arg name="buffer"> --line-buffer</arg>
- <arg name="num_tasks" > -n {{ total_tasks }}</arg>
+ <arg name="buffer"> --line-buffer</arg>
+ <arg name="separator"> -- </arg>
</arguments>
</mpirun>
<module_system type="module" allow_error="true">
Originally posted by @slevis-lmwg in https://github.com/ESCOMP/CTSM/issues/640#issuecomment-2176742021
Definition of Done:
- [ ] Add warning to build-namelist to abort when matrix is on and LND_NTHRDS > 1.
- [ ] Have our tests setup to ignore the warnings for our tests for this configuration
- [ ] Fix CN Matrix code so that OMP_NUM_THREADS env variable is NOT used, but use values from the NUOPC cap
- [ ] Make sure that fixes it -- and if not track down the problem
- [ ] Fix the real problem
- [ ] Remove the warning in the namelist
The original aux_clm failure:
ERP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn BASELINE ctsm5.2.005.cn-matrix_n09: DIFF
The same test without --clm-matrixcnOn passed and the two tests differ ONLY as follows:
< hist_fincl1 = 'TRAFFICFLUX','SNOWLIQ:A','SNOWICE:A','FCO2'
---
> hist_fincl1 = 'TRAFFICFLUX', 'SNOWLIQ:A', 'SNOWICE:A', 'FCO2', 'SNO_EXISTENCE', 'SNO_ABS', 'SNO_T:M', 'SNO_GS:X'
33,34c33,34
< hist_nhtfrq = -24,-8
< hist_wrt_matrixcn_diag = .true.
---
> hist_nhtfrq = 0,-240
> hist_wrt_matrixcn_diag = .false.
@ekluzek explains that this hist_wrt_matrixcn_diag output is important for SASU spin-ups.
We changed the ERP test to a REP test to show that this is a threading reproducibility problem:
FAIL REP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn COMPARE_base_rep2
PASS REP_P128x1_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn
This is the failure that first appears in dev172 with ccs_config_cesm0.0.89.
From #640 testing/troubleshooting on this issue:
git grep matrixcnOn | grep 'test name' | grep P64x2 on the testlist and try these tests as REP tests. Change the Lm13 test (that we know fails) with Ld13 to see if shorter also fails. They all passed.
PASS REP_P64x2_Ld10_D.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-ciso_decStart--clm-matrixcnOn
PASS REP_P64x2_Ld10_D.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel.clm-default--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-coldStart--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I2000Clm50BgcCru.derecho_intel.clm-flexCN_FUN--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I2000Clm50BgcCru.derecho_intel.clm-noFUN_flexCN--clm-matrixcnOn
PASS REP_P64x2_Ly3.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-cropMonthOutput--clm-matrixcnOn
PASS REP_P64x2_Ld5_D.f10_f10_mg37.I1850Clm50Bgc.derecho_intel.clm-ciso--clm-matrixcnOn
PASS REP_P64x2_Ld13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn
PASS REP_P64x2_D.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel.clm-default--clm-matrixcnOn
The change has to do with how MPI is binding processors on Derecho. This must have some influence on how the code is running for threading. There were problems with the code before the binding was done, which might mean that it passed before, because threading wasn't working correctly. And now we are seeing bad behavior with threading, but it's probably because threading is actually working correctly with this update.
#2296 #2289
Also links to earlier mentions of potential threading problems in the cnmatrix PR: https://github.com/ESCOMP/CTSM/pull/640#issuecomment-710538746 https://github.com/ESCOMP/CTSM/pull/640#issuecomment-1140109820
Note there are some issues with threading for CN-Matrix.
- Threading code uses different calls than without it
- Matrix is using the OMP_NUM_THREADS env variable rather than what's passed down from the NUOPC cap
The first shows potential issues in how the code was implemented, but not likely something easy to fix now. The second is something that it seems we should fix though.
@ekluzek I committed our changes (commit 4ca2b6aeeb3388efa9b4026b826d902d8a9f6f95), but they do not work. I troubleshooted a bit, and all I could get was: out=NTHRDS_LND = 0
Testing with:
./create_test REP_P64x2_Ld13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn
@ekluzek by looking elsewhere in the code and through repeated trial and error, I have managed to solve this to my pleasant surprise.
I'm working on the next TODO item and having no luck. I pushed my attempt to 8e618bb in case @ekluzek you have suggestions.
The error says
# Failed test 'matrixcnOn_with_threading'
# at ./build-namelist_test.pl line 1393.
# got: '0'
# expected: anything else
-
So the problem is that OMP_NUM_THREADS is a proper UNIX environment variable and NOT part of the env_run.xml file. So it has to be treated in a different way from the env_*.xml file variables.
-
The other problem with using NTHRDS_LND is that it's implemented in a more complex way in the XML file with a full entry like this:
<entry id="NTHRDS">
<type>integer</type>
<values>
<value compclass="ATM">1</value>
<value compclass="CPL">1</value>
<value compclass="LND">1</value>
.
.
.
<value compclass="ESP">1</value>
</values>
<desc>number of threads for each task in each component</desc>
</entry>
It's treated as NTHRDS_LND by xmlchange and xmlquery, but the perl unit tester AND the perl build-namelist code would need to implement the above syntax in order to do this correctly. This likely would take a bit to figure this out. So dropping for now.
We figure we will want to implement the first option so that the unit tester is testing this capability. But, it will take a little more finagling than we want to spend now. But, is something that's straightforward to do (at least for me).
In PR #2545 I saw a test fail in comparison baseline that is because of this issue see this comment:
https://github.com/ESCOMP/CTSM/pull/2545#issuecomment-2249546181
The test is:
ERP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn_ignore_warnings
so we should expect this same thing to come up in baseline testing for future tags as well.
@slevis-lmwg is this something that we can fix by the beta04 release (mid-Sept timeline)?
@wwieder this isn't something we really need now, but I would like to check off the next three boxes in the checklist at the top by ctsm6.0.0. So I put it in for that milestone. This is also something that I probably have the best idea on what needs to be done, once I can get to it. So assigned to me makes sense.