scream icon indicating copy to clipboard operation
scream copied to clipboard

non-BFB with diff PE layouts -- PEM.ne4pg2_ne4pg2.F2010-SCREAMv1 on pm-cpu/intel and frontier/cray

Open ndkeen opened this issue 1 year ago • 19 comments

ON pm-cpu, with intel, and only with OPT build, we see non-bfb results when changing PE layouts. Can reproduce with scream repo of June22, but need mods related to Intel compiler in scream repo. PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel

Also get fail with only 16 MPI's and after the very first step: PEM_P16x1_Ln1.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel

I could make a branch with the changes, but also needs flag cleanup in ekat.

Starting with a crash with ne30 on chrysalis, we worked toward a smaller reproducer https://github.com/E3SM-Project/scream/issues/2381

ndkeen avatar Jun 23 '23 22:06 ndkeen

While Andrew B reported that he does not see any fails with PEM on chrysalis, I do see a fail on frontier using cray compiler (using frontier branch on Jun19).

PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun19/PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.r00

Same fail with ne4: PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream

Interesting that it will also fail in DEBUG on frontier: PEM_D.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream

ndkeen avatar Jun 23 '23 23:06 ndkeen

Copying comment from https://github.com/E3SM-Project/scream/issues/2381#issuecomment-1605076280:

Runs on Chrysalis show no diffs. Perhaps this is an issue isolated to pm-cpu Intel.

Script:

tests=""
for npe in 256 362 512 640; do   
    for compiler in gnu intel; do
        tests+=" PEM_P${npe}x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_${compiler}"
    done
done
$e3sm/cime/scripts/create_test $tests --machine chrysalis --project $wcid -j 64

Results showing both PASS for each test and bfbhash comparison among tests:

$ ./cs.status.20230623_164037_6v0xpu | grep Overall; for compiler in gnu intel; do echo $compiler; for i in PEM_*${compiler}*; do zgrep bfbhash $i/run/e3sm.log* | tail -n 1; done; done
  PEM_P256x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
  PEM_P256x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
  PEM_P362x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
  PEM_P362x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
  PEM_P512x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
  PEM_P512x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
  PEM_P640x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
  PEM_P640x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
gnu
  0: bfbhash>             36 d228e1e60be6efd2 (Hommexx)
  0: bfbhash>             36 d228e1e60be6efd2 (Hommexx)
  0: bfbhash>             36 d228e1e60be6efd2 (Hommexx)
  0: bfbhash>             36 d228e1e60be6efd2 (Hommexx)
intel
  0: bfbhash>             36 ed625355c8369ff8 (Hommexx)
  0: bfbhash>             36 ed625355c8369ff8 (Hommexx)
  0: bfbhash>             36 ed625355c8369ff8 (Hommexx)
  0: bfbhash>             36 ed625355c8369ff8 (Hommexx)

ambrad avatar Jun 23 '23 23:06 ambrad

So far Crusher does not reproduce any of these. A summary of results follows.

Repo is master at e57ed3848d.
  PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:
  PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:
  PEM_P32x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:
  PEM_P64x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:
  PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
  PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
  PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
  PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:

In addition, I ran repeat testing on PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level and stopped at 20 passes.

I'm repeating a subset of these with the branch machines/frontier at https://github.com/E3SM-Project/scream/commit/2a918e18ef80ffb50eaea63a7f528e27fb3a0e32:

compiler=crayclang-scream
machine=crusher-scream-gpu

tests=""
for npe in 16 32 64 128; do
    tests+=" PEM_P${npe}x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.${machine}_${compiler}"
done
echo $tests

$e3sm/cime/scripts/create_test $tests

Results:

  PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
  PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
  PEM_P32x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
  PEM_P64x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:

So none of the code changes to ELM in the machines/frontier branch are a general problem.

ambrad avatar Jun 24 '23 00:06 ambrad

Currently running the following on Frontier with machines/frontier at 2a918e18ef and will update this comment with results. The goal is to reproduce Noel's run, hopefully with ne4pg2 and the scream-internal_diagnostics_level testmod.

compiler=crayclang-scream  
machine=frontier-scream-gpu

tests=""
for sfx in "" ".scream-internal_diagnostics_level"; do
    tests+=" PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.${machine}_${compiler}${sfx}"
done
for npe in 16; do
    for sfx in "" ".scream-internal_diagnostics_level"; do
        tests+=" PEM_P${npe}x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.${machine}_${compiler}${sfx}"
    done
done
echo $tests

$e3sm/cime/scripts/create_test $tests

PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level indeed diffs, so we can examine the hashes. I see this as the first diff:

$ find . -name e3sm.log\*
./run/case2run/e3sm.log.1360717.230624-182408.gz
./run/e3sm.log.1360717.230624-182228.gz
$ zgrep "hash>" ./run/e3sm.log.1360717.230624-182228.gz > r1.txt
$ zgrep "hash>" ./run/case2run/e3sm.log.1360717.230624-182408.gz > r2.txt
$ diff r1.txt r2.txt | head -n 20
6564c6564
< 0: exxhash>    1-  0.33333 1 d474ba8f10ecada3 (SurfaceCouplingImporter-pst-sc-0)
---
> 0: exxhash>    1-  0.33333 1 d474ba8f10ecada5 (SurfaceCouplingImporter-pst-sc-0)

So the diff is occurring in one of the surface components or the importer, 1/3 of the way through the first day. Running the case again gives the same diff.

I see essentially the same diff for PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level:

1844c1844
< 0: exxhash>    1-  0.08333 1 77cd2b37308668a6 (SurfaceCouplingImporter-pst-sc-0)
---
> 0: exxhash>    1-  0.08333 1 77cd2b37308668a5 (SurfaceCouplingImporter-pst-sc-0)

Interestingly, so far (3 runs) I've not been able to reproduce Noel's PEM_D.ne4pg2_... failure. I'm going to try with the hash diagnostics off, in case they are affecting things. Later: One run passed. Now I'll run the repeat script. I'm doing 2-day runs: PEM_D_Ld2.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.

For now I'm working under the hypothesis that _D is in fact fine. I noticed that the Depends file for Frontier is missing the CICE opt mods that we use on Crusher, so I'm going to test those, since a diff in CICE is consistent with the hash lines above. Later: The change https://github.com/E3SM-Project/scream/commit/68a174902527d3fa831a5d4ccc55d27f6c763cee looks promising.

ambrad avatar Jun 24 '23 21:06 ambrad

@ndkeen this is very unlikely, but is it possible that the DIFF you saw resulted from the following sequence? 1. Run PEM_D.ne4pg2... 2. The run gets cancelled due to the wallclock limit. 3. Manually change STOP_N in env_run.xml from 5 to 1 or 2. 4. Forget to do the same for the case2 env_run.xml file. 5. ./case.submit. 6. Diff due to different number of days in the two runs.

I ask because so far I'm unable to reproduce the DIFF with PEM_D_Ld2 and similar tests. Yet I suspect it's an F90 opt-level issue in one of the surface components that is the cause, based on exxhash> lines. If we were to change our assessment to that _D passes, we could then use the usual opt-level reduction on a bunch of F90 files to solve the diff in practice.

ambrad avatar Jun 27 '23 19:06 ambrad

All of my PEM tests have worked the first time. And I can reproduce on pm-cpu/intel and fronter/cray

ndkeen avatar Jun 27 '23 19:06 ndkeen

On pm-cpu/intel, have you run with the scream-internal_diagnostics_level testmod to isolate the diff?

ambrad avatar Jun 27 '23 19:06 ambrad

Re: the PEM_D fails, I'm seeing this in your test directory:

[[email protected] PEM_D.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.20230623_205834_u7fj79]$ for i in `find . -name env_run.xml`; do echo $i; grep "\"STOP_N" $i; done
./case2/PEM_D.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.20230623_205834_u7fj79/env_run.xml
    <entry id="STOP_N" value="5">
./env_run.xml
    <entry id="STOP_N" value="1">

The two STOP_N values are different. I did the same thing when setting up my repeat testing, which made me think you might have, too.

ambrad avatar Jun 27 '23 19:06 ambrad

On pm-cpu/intel (which again unfortunately is not ready out-of-box, but i could make a branch) I see I did not turn on internal diagnostics with these simply tests, so just started:

 PEM_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level
 PEM_P8x1_Ln6.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level

The shorter, Ln6 test fails compare (same with longer run): /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se69-jun22/PEM_P8x1_Ln6.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level.r00

And on Frontier, using newer branch, I just started:

PEM_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream 
PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream

and

PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level
PEM_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level

The DEBUG test on frontier with internal diag fails compare here: /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun26/PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level.r00

Will update here when they complete

ndkeen avatar Jun 27 '23 20:06 ndkeen

@ndkeen this looks promising: https://github.com/E3SM-Project/scream/commit/68a174902527d3fa831a5d4ccc55d27f6c763cee

If you'd like to put it through its paces on Frontier to confirm what I'm seeing and you find it works, I'll merge the commit into the machines/frontier branch.

You can get this commit by merging ambrad/frontier-cice-O0 into your local machines/frontier branch.

ambrad avatar Jun 27 '23 20:06 ambrad

Noel and I think the CICE optimization reduction is promising for Frontier: all of our tests have passed. I've merged the commit into machines/frontier.

ambrad avatar Jun 28 '23 00:06 ambrad

The DEBUG test on frontier with internal diag fails compare here: /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun26/PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level.r00

  1. There are only two e3sm.log files in this test directory, so subsequent points refer to the only PEM test results available here:
$ find . -name e3sm.log\*
./run/e3sm.log.1363747.230627-164100.gz
./run/case2run/e3sm.log.1363747.230627-180418
  1. The FAIL is in RUN, not COMPARE:
FAIL PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level RUN time=7198
PEND PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level COMPARE_base_modpes
  1. The job ran out of time:
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
0: slurmstepd: error: *** STEP 1363747.1 ON frontier10368 CANCELLED AT 2023-06-27T18:40:52 DUE TO TIME LIMIT ***
  1. The hash stream that is available shows the first diff is where the first run was able to go longer than the second:
$ zgrep "hash>" ./run/e3sm.log.1363747.230627-164100.gz > r1.txt; zgrep "hash>" ./run/case2run/e3sm.log.1363747.230627-180418 > r2.txt; diff r1.txt r2.txt | head -n 20
40853,98407d40852
< 0: exxhash>    1-  2.04167 0 1af82f2ef5aa3059 (mac_aero_mic-pre-sc-17)
  1. I conclude that this is not a valid debug-build diff; therefore, we have yet to see a debug-build diff.

ambrad avatar Jun 28 '23 18:06 ambrad

Yep, you're right, I did not look closely enough at the reasons for the fails and the DEBUG PEM tests on frontier are run fails (timeouts) not compare fails.

I can verify that at least one PEM (OPT build) does pass with the reduced opt flag in CICE sources on frontier. But the machine has been down a while and no others tests have run.

On pm-cpu/intel, I still see the compare fail, even with PEM_P8x1_Ln6.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level It does not fail with DEBUG and does not fail with gnu compiler. I tried reducing opt on the CICE sources in the way as on frontier, no difference. I then tried reducing compiler opts in general, and I can get a PASS if I change the CXX compiler flag in EKAT from-O3 to -O0. When I try -O1 instead of -O0, I see compare fail.

login30% pwd
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se69-jun22/PEM_P8x1_Ln6.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level.r00
login30% zgrep hash run/e3sm.log.10725409.230627-130518.gz > a
login30% zgrep hash run/case2run/e3sm.log.10725409.230627-130619.gz > b
login30% diff a b | head
266,267c266,267
< 0: hxxhash>              5 0 4e397ce9f7564c5f (T BE-post-ComposeTransport-q-HV-0)
< 0: hxxhash>              5 1  7e656ffc778219b (T BE-post-ComposeTransport-q-HV-0)
---
> 0: hxxhash>              5 0 4e397ce9f7563dc0 (T BE-post-ComposeTransport-q-HV-0)
> 0: hxxhash>              5 1  7e656ffc77812fc (T BE-post-ComposeTransport-q-HV-0)
271,272c271,272
< 0: hxxhash>              5 0 d077dd77b62f716d (T BE-post-ComposeTransport-q-HV-1)
< 0: hxxhash>              5 1 8a24b78d865146a9 (T BE-post-ComposeTransport-q-HV-1)

ndkeen avatar Jun 28 '23 19:06 ndkeen

For the pm-cpu run, the hash lines show the diff is in the Hommexx version of SL transport. The hyperviscosity operator sees the diff first because of the boundary exchange pattern, but very likely it occurs in the core SL code and not HV. My thinking right now is this: 1. The priority on running EAMxx on pm-cpu is quite low. 2. This is extremely likely to be a compiler optimization-pass issue, not an application-side bug. Therefore, I'm not going to attempt to resolve it right now.

Edit: It occurs to me that if the diff occurs with even -O1, the compiler might have an actual bug in its optimizer, given that, until now, we've never seen a diff come from SL transport.

ambrad avatar Jun 28 '23 19:06 ambrad

@ndkeen, one thing you might check is -fp-model for the C++ code. In intel_pm-cpu.cmake, I see

string(APPEND CXXFLAGS " -fp-model=precise") # and manually add precise                                                                                                                                                                 
...
string(APPEND CXXFLAGS " -fp-model=consistent")

That is, it's precise, but then a later line passes consistent, which I think might partially or fully override precise. I don't understand consistent very well, but my impression is it's not as safe as precise or source.

ambrad avatar Jun 28 '23 21:06 ambrad

-fp-model=consistent is the most safe, but I think is only available for ifort (well the man page indicated ifort only, but icpx accepts it). I can try adjusting flags next.

ndkeen avatar Jun 29 '23 03:06 ndkeen

-fp-model=consistent is the most safe, but I think is only available for ifort (well the man page indicated ifort only, but icpx accepts it). I can try adjusting flags next.

Ok, if consistent is at least as safe or meaningless for ifx, then that can't be it. Have you tried an ifort rather than ifx build?

ambrad avatar Jun 29 '23 04:06 ambrad

Oh, I wasn't thinking when I wrote "ifx" above. On pm-cpu, it's the C++ compiler that matters, as all the dycore code is in C++.

Chrysalis uses icpc (ICC) 19.1.3.304 20200925.

ambrad avatar Jun 30 '23 03:06 ambrad

Currently, it looks like many C++ files are being built ignoring flags set via CIME, so I've been editing externals/ekat/cmake/EkatSetCompilerFlags.cmake to experiment.

I found that on pm-cpu with Intel, I can get compare PASS if I build with the default -O3 and then add -fp-model=precise to ekat C++ files.

      string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O3")
      string(APPEND CMAKE_CXX_FLAGS_RELEASE " -fp-model=precise")

If this deemed acceptable, would need to make a change to ekat.

It might be a good time to set compiler flags in CIME, and let ekat/scream use them.

ndkeen avatar Jul 05 '23 23:07 ndkeen