E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

eamxx: Non-BFB behavior with `ne4pg2_ne4pg2.F2010-SCREAMv1` cases on pm-cpu with Intel when changing NTASKS

Open ndkeen opened this issue 1 year ago • 5 comments

I'm seeing that I get different results when I change the number of MPI tasks for CPU jobs of scream. Only tested on pm-cpu (and muller-cpu). I've been running scaling tests for both e3sm/scream. All e3sm cases are BFB, but it looks like, every different node count used for a scream case results in a different set of hashes. For a given MPI task count, re-running the case looks BFB as expected.

And, just now, I tried PEM.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel which does fail. /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/PEM.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel.r00

Looks like it passes with DEBUG PEM_D_P1024_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel

Below I show can reproduce with ne4, so I changed the title of the issue.

ndkeen avatar Oct 01 '24 17:10 ndkeen

This might be specific to Intel. We have PEM_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_gnu.scream-spa_remap--scream-output-preset-4 in our nightly, which is GNU on pm-cpu.

ambrad avatar Oct 01 '24 19:10 ambrad

Yes this is with Intel and we don't see it with GNU.

ndkeen avatar Dec 06 '24 19:12 ndkeen

Noting we also see diff with ne4. PEM.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_intel fails

ndkeen avatar Jan 14 '25 18:01 ndkeen

Noting I still see a diff with Intel compiler for at least this test:

ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-bfbhash--eamxx-output-preset-6

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-feb21/ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-bfbhash--eamxx-output-preset-6.e3sm_eamxx_v1_medres

ndkeen avatar Feb 21 '25 22:02 ndkeen

With Nov 6 2025 checkout

GNU all pass
ERP.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu
PEM.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_gnu
SMS_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
PEM_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
ERS_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_gnu.eamxx-L72

AMD all pass
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_amdclang.eamxx-L72
PEM_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_amdclang.eamxx-L72

Intel pass
SMS.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_intel
SMS_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72
ERS_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72
ERP_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72
PEM_D_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72

Intel fail compare:
ERP.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_intel
PEM.ne4pg2_oQU480.F2010-SCREAMv1.pm-cpu_intel
ERP_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72
PEM_Ln22.conusx4v1pg2_r05_oECv3.F2010-SCREAMv1-noAero.pm-cpu_intel.eamxx-L72

Running the ne4 case with intel, and turning on all hashing, we can see where the diffs happen between a case with 64 and 96 tasks (all single thread, without any forcing of openmp builds as we did in other issues):

  File 1: f4.F2010-SCREAMv1.ne4pg2_ne4pg2.nexty-nov11.intel.n001.p064x111111.18s.bfb2.L72/run/e3sm.log.45118395.251111-162903.gz
  File 2: f4.F2010-SCREAMv1.ne4pg2_ne4pg2.nexty-nov11.intel.n001.p096x111111.18s.bfb2.L72/run/e3sm.log.45118424.251111-162835.gz

================================================================================
REPORT: ABSOLUTE FIRST HASH DIFFERENCE (Starting at First Step)
================================================================================
#  DIVERGENCE FOUND at BFB Step 0 (Entry #504)

--- Context (Last 3 identical hashes) ---
 0: hxxhash>              5 0 58d0e91b44d916e0 (E BE-pre-ComposeTransport-q-HV-0)
 0: hxxhash>              5 1 c9f0a2ef2b3fd16d (E BE-pre-ComposeTransport-q-HV-0)
 0: hxxhash>              5 2 b874c373b720216d (E BE-pre-ComposeTransport-q-HV-0)

--- Divergence Found ---
-0: hxxhash>              5 0 4912c99cc239cc17 (T BE-pre-ComposeTransport-q-HV-0)  (File 1)
+0: hxxhash>              5 0 4912c99cc239c5bb (T BE-pre-ComposeTransport-q-HV-0)  (File 2)
================================================================================

================================================================================
REPORT: FIRST DIVERGENCE AFTER INITIALIZATION (Skipping First Step)
================================================================================
#  DIVERGENCE FOUND at BFB Step 1 (Entry #626)

--- Context (Last 3 identical hashes) ---
 0: hxxhash>              5 2 b874c373b720216d (E BE-post-ComposeTransport-qdp-DSS-1)
 0: hxxhash>              5 0 c6f9d9a1d4ce8efa (T BE-post-ComposeTransport-qdp-DSS-1)
 0: hxxhash>              5 1 d75cd8fd8d26a7f8 (T BE-post-ComposeTransport-qdp-DSS-1)

--- Divergence Found ---
-0: bfbhash>              1 4ec4947c9350cbaf (Hommexx)  (File 1)
+0: bfbhash>              1 4ec4947c93ab3f1b (Hommexx)  (File 2)
================================================================================

So first we note the diffs that happen in first step (step 0), which may or may not be result if initialization. But then, the very next bfbhash is different between the two cases. Note these two cases both use 72 vertical levels to further reduce complexity.

ndkeen avatar Nov 11 '25 23:11 ndkeen