CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

Reproducibility problem with cnmatrix first occurs in dev172 going from ccs_config_cesm0.0.88 to ...89

Open slevis-lmwg opened this issue 1 year ago • 3 comments

          A cnmatrix reproducibility problem first occurs in dev172 and more specifically when going from ccs_config_cesm0.0.88 to ccs_config_cesm0.0.89. This is the only difference that I see when comparing 89 to 88:

git diff ccs_config_cesm0.0.88 ccs_config_cesm0.0.89

diff --git a/machines/derecho/config_machines.xml b/machines/derecho/config_machines.xml
index 633fe74..6f9d6fd 100644
--- a/machines/derecho/config_machines.xml
+++ b/machines/derecho/config_machines.xml
@@ -21,11 +21,11 @@
     <MPI_GPU_WRAPPER_SCRIPT>get_local_rank</MPI_GPU_WRAPPER_SCRIPT>
     <PROJECT_REQUIRED>TRUE</PROJECT_REQUIRED>
     <mpirun mpilib="default">
-      <executable>mpiexec</executable>
+      <executable>mpibind</executable>
       <arguments>
         <arg name="label"> --label</arg>
-        <arg name="buffer"> --line-buffer</arg>
-        <arg name="num_tasks" > -n {{ total_tasks }}</arg>
+       <arg name="buffer"> --line-buffer</arg>
+       <arg name="separator"> -- </arg>
       </arguments>
     </mpirun>
     <module_system type="module" allow_error="true">

Originally posted by @slevis-lmwg in https://github.com/ESCOMP/CTSM/issues/640#issuecomment-2176742021

Definition of Done:

  • [ ] Add warning to build-namelist to abort when matrix is on and LND_NTHRDS > 1.
  • [ ] Have our tests setup to ignore the warnings for our tests for this configuration
  • [ ] Fix CN Matrix code so that OMP_NUM_THREADS env variable is NOT used, but use values from the NUOPC cap
  • [ ] Make sure that fixes it -- and if not track down the problem
  • [ ] Fix the real problem
  • [ ] Remove the warning in the namelist

slevis-lmwg avatar Jun 24 '24 22:06 slevis-lmwg

The original aux_clm failure: ERP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn BASELINE ctsm5.2.005.cn-matrix_n09: DIFF The same test without --clm-matrixcnOn passed and the two tests differ ONLY as follows:

<  hist_fincl1 = 'TRAFFICFLUX','SNOWLIQ:A','SNOWICE:A','FCO2'
---
>  hist_fincl1 = 'TRAFFICFLUX', 'SNOWLIQ:A', 'SNOWICE:A', 'FCO2', 'SNO_EXISTENCE', 'SNO_ABS', 'SNO_T:M', 'SNO_GS:X'
33,34c33,34
<  hist_nhtfrq = -24,-8
<  hist_wrt_matrixcn_diag = .true.
---
>  hist_nhtfrq = 0,-240
>  hist_wrt_matrixcn_diag = .false.

@ekluzek explains that this hist_wrt_matrixcn_diag output is important for SASU spin-ups.

We changed the ERP test to a REP test to show that this is a threading reproducibility problem:

FAIL REP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn COMPARE_base_rep2
PASS REP_P128x1_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn

This is the failure that first appears in dev172 with ccs_config_cesm0.0.89.

From #640 testing/troubleshooting on this issue: git grep matrixcnOn | grep 'test name' | grep P64x2 on the testlist and try these tests as REP tests. Change the Lm13 test (that we know fails) with Ld13 to see if shorter also fails. They all passed.

PASS REP_P64x2_Ld10_D.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-ciso_decStart--clm-matrixcnOn
PASS REP_P64x2_Ld10_D.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel.clm-default--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-coldStart--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I2000Clm50BgcCru.derecho_intel.clm-flexCN_FUN--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I2000Clm50BgcCru.derecho_intel.clm-noFUN_flexCN--clm-matrixcnOn
PASS REP_P64x2_Ly3.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-cropMonthOutput--clm-matrixcnOn
PASS REP_P64x2_Ld5_D.f10_f10_mg37.I1850Clm50Bgc.derecho_intel.clm-ciso--clm-matrixcnOn
PASS REP_P64x2_Ld13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn
PASS REP_P64x2_D.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel.clm-default--clm-matrixcnOn

slevis-lmwg avatar Jun 24 '24 22:06 slevis-lmwg

The change has to do with how MPI is binding processors on Derecho. This must have some influence on how the code is running for threading. There were problems with the code before the binding was done, which might mean that it passed before, because threading wasn't working correctly. And now we are seeing bad behavior with threading, but it's probably because threading is actually working correctly with this update.

#2296 #2289

Also links to earlier mentions of potential threading problems in the cnmatrix PR: https://github.com/ESCOMP/CTSM/pull/640#issuecomment-710538746 https://github.com/ESCOMP/CTSM/pull/640#issuecomment-1140109820

ekluzek avatar Jun 24 '24 22:06 ekluzek

Note there are some issues with threading for CN-Matrix.

  1. Threading code uses different calls than without it
  2. Matrix is using the OMP_NUM_THREADS env variable rather than what's passed down from the NUOPC cap

The first shows potential issues in how the code was implemented, but not likely something easy to fix now. The second is something that it seems we should fix though.

ekluzek avatar Jun 24 '24 22:06 ekluzek

@ekluzek I committed our changes (commit 4ca2b6aeeb3388efa9b4026b826d902d8a9f6f95), but they do not work. I troubleshooted a bit, and all I could get was: out=NTHRDS_LND = 0

Testing with: ./create_test REP_P64x2_Ld13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn

slevis-lmwg avatar Jul 11 '24 00:07 slevis-lmwg

@ekluzek by looking elsewhere in the code and through repeated trial and error, I have managed to solve this to my pleasant surprise.

slevis-lmwg avatar Jul 11 '24 23:07 slevis-lmwg

I'm working on the next TODO item and having no luck. I pushed my attempt to 8e618bb in case @ekluzek you have suggestions.

The error says

#   Failed test 'matrixcnOn_with_threading'
#   at ./build-namelist_test.pl line 1393.
#          got: '0'
#     expected: anything else

slevis-lmwg avatar Jul 12 '24 22:07 slevis-lmwg

  1. So the problem is that OMP_NUM_THREADS is a proper UNIX environment variable and NOT part of the env_run.xml file. So it has to be treated in a different way from the env_*.xml file variables.

  2. The other problem with using NTHRDS_LND is that it's implemented in a more complex way in the XML file with a full entry like this:

    <entry id="NTHRDS">
      <type>integer</type>
      <values>
        <value compclass="ATM">1</value>
        <value compclass="CPL">1</value>
        <value compclass="LND">1</value>
.
.
.
        <value compclass="ESP">1</value>
      </values>
      <desc>number of threads for each task in each component</desc>
    </entry>

It's treated as NTHRDS_LND by xmlchange and xmlquery, but the perl unit tester AND the perl build-namelist code would need to implement the above syntax in order to do this correctly. This likely would take a bit to figure this out. So dropping for now.

We figure we will want to implement the first option so that the unit tester is testing this capability. But, it will take a little more finagling than we want to spend now. But, is something that's straightforward to do (at least for me).

ekluzek avatar Jul 15 '24 20:07 ekluzek

In PR #2545 I saw a test fail in comparison baseline that is because of this issue see this comment:

https://github.com/ESCOMP/CTSM/pull/2545#issuecomment-2249546181

The test is:

ERP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn_ignore_warnings

so we should expect this same thing to come up in baseline testing for future tags as well.

ekluzek avatar Jul 26 '24 19:07 ekluzek

@slevis-lmwg is this something that we can fix by the beta04 release (mid-Sept timeline)?

wwieder avatar Aug 08 '24 16:08 wwieder

@wwieder this isn't something we really need now, but I would like to check off the next three boxes in the checklist at the top by ctsm6.0.0. So I put it in for that milestone. This is also something that I probably have the best idea on what needs to be done, once I can get to it. So assigned to me makes sense.

ekluzek avatar Sep 09 '24 19:09 ekluzek