scream
scream copied to clipboard
non-BFB with diff PE layouts -- PEM.ne4pg2_ne4pg2.F2010-SCREAMv1 on pm-cpu/intel and frontier/cray
ON pm-cpu, with intel, and only with OPT build, we see non-bfb results when changing PE layouts.
Can reproduce with scream repo of June22, but need mods related to Intel compiler in scream repo.
PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel
Also get fail with only 16 MPI's and after the very first step:
PEM_P16x1_Ln1.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel
I could make a branch with the changes, but also needs flag cleanup in ekat.
Starting with a crash with ne30 on chrysalis, we worked toward a smaller reproducer https://github.com/E3SM-Project/scream/issues/2381
While Andrew B reported that he does not see any fails with PEM on chrysalis, I do see a fail on frontier using cray compiler (using frontier branch on Jun19).
PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun19/PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.r00
Same fail with ne4:
PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream
Interesting that it will also fail in DEBUG on frontier:
PEM_D.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream
Copying comment from https://github.com/E3SM-Project/scream/issues/2381#issuecomment-1605076280:
Runs on Chrysalis show no diffs. Perhaps this is an issue isolated to pm-cpu Intel.
Script:
tests=""
for npe in 256 362 512 640; do
for compiler in gnu intel; do
tests+=" PEM_P${npe}x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_${compiler}"
done
done
$e3sm/cime/scripts/create_test $tests --machine chrysalis --project $wcid -j 64
Results showing both PASS
for each test and bfbhash
comparison among tests:
$ ./cs.status.20230623_164037_6v0xpu | grep Overall; for compiler in gnu intel; do echo $compiler; for i in PEM_*${compiler}*; do zgrep bfbhash $i/run/e3sm.log* | tail -n 1; done; done
PEM_P256x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
PEM_P256x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
PEM_P362x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
PEM_P362x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
PEM_P512x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
PEM_P512x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
PEM_P640x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_gnu (Overall: PASS) details:
PEM_P640x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.chrysalis_intel (Overall: PASS) details:
gnu
0: bfbhash> 36 d228e1e60be6efd2 (Hommexx)
0: bfbhash> 36 d228e1e60be6efd2 (Hommexx)
0: bfbhash> 36 d228e1e60be6efd2 (Hommexx)
0: bfbhash> 36 d228e1e60be6efd2 (Hommexx)
intel
0: bfbhash> 36 ed625355c8369ff8 (Hommexx)
0: bfbhash> 36 ed625355c8369ff8 (Hommexx)
0: bfbhash> 36 ed625355c8369ff8 (Hommexx)
0: bfbhash> 36 ed625355c8369ff8 (Hommexx)
So far Crusher does not reproduce any of these. A summary of results follows.
Repo is master at e57ed3848d.
PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:
PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:
PEM_P32x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:
PEM_P64x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:
PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level (Overall: PASS) details:
In addition, I ran repeat testing on PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream.scream-internal_diagnostics_level
and stopped at 20 passes.
I'm repeating a subset of these with the branch machines/frontier at https://github.com/E3SM-Project/scream/commit/2a918e18ef80ffb50eaea63a7f528e27fb3a0e32:
compiler=crayclang-scream
machine=crusher-scream-gpu
tests=""
for npe in 16 32 64 128; do
tests+=" PEM_P${npe}x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.${machine}_${compiler}"
done
echo $tests
$e3sm/cime/scripts/create_test $tests
Results:
PEM_P128x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
PEM_P32x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
PEM_P64x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.crusher-scream-gpu_crayclang-scream (Overall: PASS) details:
So none of the code changes to ELM in the machines/frontier branch are a general problem.
Currently running the following on Frontier with machines/frontier at 2a918e18ef and will update this comment with results. The goal is to reproduce Noel's run, hopefully with ne4pg2 and the scream-internal_diagnostics_level testmod.
compiler=crayclang-scream
machine=frontier-scream-gpu
tests=""
for sfx in "" ".scream-internal_diagnostics_level"; do
tests+=" PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.${machine}_${compiler}${sfx}"
done
for npe in 16; do
for sfx in "" ".scream-internal_diagnostics_level"; do
tests+=" PEM_P${npe}x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.${machine}_${compiler}${sfx}"
done
done
echo $tests
$e3sm/cime/scripts/create_test $tests
PEM.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level indeed diffs, so we can examine the hashes. I see this as the first diff:
$ find . -name e3sm.log\*
./run/case2run/e3sm.log.1360717.230624-182408.gz
./run/e3sm.log.1360717.230624-182228.gz
$ zgrep "hash>" ./run/e3sm.log.1360717.230624-182228.gz > r1.txt
$ zgrep "hash>" ./run/case2run/e3sm.log.1360717.230624-182408.gz > r2.txt
$ diff r1.txt r2.txt | head -n 20
6564c6564
< 0: exxhash> 1- 0.33333 1 d474ba8f10ecada3 (SurfaceCouplingImporter-pst-sc-0)
---
> 0: exxhash> 1- 0.33333 1 d474ba8f10ecada5 (SurfaceCouplingImporter-pst-sc-0)
So the diff is occurring in one of the surface components or the importer, 1/3 of the way through the first day. Running the case again gives the same diff.
I see essentially the same diff for PEM_P16x1_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level:
1844c1844
< 0: exxhash> 1- 0.08333 1 77cd2b37308668a6 (SurfaceCouplingImporter-pst-sc-0)
---
> 0: exxhash> 1- 0.08333 1 77cd2b37308668a5 (SurfaceCouplingImporter-pst-sc-0)
Interestingly, so far (3 runs) I've not been able to reproduce Noel's PEM_D.ne4pg2_... failure. I'm going to try with the hash diagnostics off, in case they are affecting things. Later: One run passed. Now I'll run the repeat script. I'm doing 2-day runs: PEM_D_Ld2.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.
For now I'm working under the hypothesis that _D is in fact fine. I noticed that the Depends file for Frontier is missing the CICE opt mods that we use on Crusher, so I'm going to test those, since a diff in CICE is consistent with the hash lines above. Later: The change https://github.com/E3SM-Project/scream/commit/68a174902527d3fa831a5d4ccc55d27f6c763cee looks promising.
@ndkeen this is very unlikely, but is it possible that the DIFF you saw resulted from the following sequence? 1. Run PEM_D.ne4pg2... 2. The run gets cancelled due to the wallclock limit. 3. Manually change STOP_N in env_run.xml from 5 to 1 or 2. 4. Forget to do the same for the case2 env_run.xml file. 5. ./case.submit. 6. Diff due to different number of days in the two runs.
I ask because so far I'm unable to reproduce the DIFF with PEM_D_Ld2 and similar tests. Yet I suspect it's an F90 opt-level issue in one of the surface components that is the cause, based on exxhash>
lines. If we were to change our assessment to that _D passes, we could then use the usual opt-level reduction on a bunch of F90 files to solve the diff in practice.
All of my PEM tests have worked the first time. And I can reproduce on pm-cpu/intel and fronter/cray
On pm-cpu/intel, have you run with the scream-internal_diagnostics_level testmod to isolate the diff?
Re: the PEM_D fails, I'm seeing this in your test directory:
[[email protected] PEM_D.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.20230623_205834_u7fj79]$ for i in `find . -name env_run.xml`; do echo $i; grep "\"STOP_N" $i; done
./case2/PEM_D.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.20230623_205834_u7fj79/env_run.xml
<entry id="STOP_N" value="5">
./env_run.xml
<entry id="STOP_N" value="1">
The two STOP_N values are different. I did the same thing when setting up my repeat testing, which made me think you might have, too.
On pm-cpu/intel (which again unfortunately is not ready out-of-box, but i could make a branch) I see I did not turn on internal diagnostics with these simply tests, so just started:
PEM_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level
PEM_P8x1_Ln6.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level
The shorter, Ln6 test fails compare (same with longer run):
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se69-jun22/PEM_P8x1_Ln6.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level.r00
And on Frontier, using newer branch, I just started:
PEM_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream
PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream
and
PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level
PEM_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level
The DEBUG test on frontier with internal diag fails compare here:
/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun26/PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level.r00
Will update here when they complete
@ndkeen this looks promising: https://github.com/E3SM-Project/scream/commit/68a174902527d3fa831a5d4ccc55d27f6c763cee
If you'd like to put it through its paces on Frontier to confirm what I'm seeing and you find it works, I'll merge the commit into the machines/frontier branch.
You can get this commit by merging ambrad/frontier-cice-O0 into your local machines/frontier branch.
Noel and I think the CICE optimization reduction is promising for Frontier: all of our tests have passed. I've merged the commit into machines/frontier.
The DEBUG test on frontier with internal diag fails compare here: /lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun26/PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level.r00
- There are only two e3sm.log files in this test directory, so subsequent points refer to the only PEM test results available here:
$ find . -name e3sm.log\*
./run/e3sm.log.1363747.230627-164100.gz
./run/case2run/e3sm.log.1363747.230627-180418
- The FAIL is in RUN, not COMPARE:
FAIL PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level RUN time=7198
PEND PEM_D_P8x1.ne4pg2_ne4pg2.F2010-SCREAMv1.frontier-scream-gpu_crayclang-scream.scream-internal_diagnostics_level COMPARE_base_modpes
- The job ran out of time:
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
0: slurmstepd: error: *** STEP 1363747.1 ON frontier10368 CANCELLED AT 2023-06-27T18:40:52 DUE TO TIME LIMIT ***
- The hash stream that is available shows the first diff is where the first run was able to go longer than the second:
$ zgrep "hash>" ./run/e3sm.log.1363747.230627-164100.gz > r1.txt; zgrep "hash>" ./run/case2run/e3sm.log.1363747.230627-180418 > r2.txt; diff r1.txt r2.txt | head -n 20
40853,98407d40852
< 0: exxhash> 1- 2.04167 0 1af82f2ef5aa3059 (mac_aero_mic-pre-sc-17)
- I conclude that this is not a valid debug-build diff; therefore, we have yet to see a debug-build diff.
Yep, you're right, I did not look closely enough at the reasons for the fails and the DEBUG PEM tests on frontier are run fails (timeouts) not compare fails.
I can verify that at least one PEM (OPT build) does pass with the reduced opt flag in CICE sources on frontier. But the machine has been down a while and no others tests have run.
On pm-cpu/intel, I still see the compare fail, even with
PEM_P8x1_Ln6.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level
It does not fail with DEBUG and does not fail with gnu compiler.
I tried reducing opt on the CICE sources in the way as on frontier, no difference.
I then tried reducing compiler opts in general, and I can get a PASS if I change the CXX compiler flag in EKAT from-O3 to -O0
. When I try -O1
instead of -O0
, I see compare fail.
login30% pwd
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se69-jun22/PEM_P8x1_Ln6.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.scream-internal_diagnostics_level.r00
login30% zgrep hash run/e3sm.log.10725409.230627-130518.gz > a
login30% zgrep hash run/case2run/e3sm.log.10725409.230627-130619.gz > b
login30% diff a b | head
266,267c266,267
< 0: hxxhash> 5 0 4e397ce9f7564c5f (T BE-post-ComposeTransport-q-HV-0)
< 0: hxxhash> 5 1 7e656ffc778219b (T BE-post-ComposeTransport-q-HV-0)
---
> 0: hxxhash> 5 0 4e397ce9f7563dc0 (T BE-post-ComposeTransport-q-HV-0)
> 0: hxxhash> 5 1 7e656ffc77812fc (T BE-post-ComposeTransport-q-HV-0)
271,272c271,272
< 0: hxxhash> 5 0 d077dd77b62f716d (T BE-post-ComposeTransport-q-HV-1)
< 0: hxxhash> 5 1 8a24b78d865146a9 (T BE-post-ComposeTransport-q-HV-1)
For the pm-cpu run, the hash lines show the diff is in the Hommexx version of SL transport. The hyperviscosity operator sees the diff first because of the boundary exchange pattern, but very likely it occurs in the core SL code and not HV. My thinking right now is this: 1. The priority on running EAMxx on pm-cpu is quite low. 2. This is extremely likely to be a compiler optimization-pass issue, not an application-side bug. Therefore, I'm not going to attempt to resolve it right now.
Edit: It occurs to me that if the diff occurs with even -O1, the compiler might have an actual bug in its optimizer, given that, until now, we've never seen a diff come from SL transport.
@ndkeen, one thing you might check is -fp-model
for the C++ code. In intel_pm-cpu.cmake, I see
string(APPEND CXXFLAGS " -fp-model=precise") # and manually add precise
...
string(APPEND CXXFLAGS " -fp-model=consistent")
That is, it's precise
, but then a later line passes consistent
, which I think might partially or fully override precise
. I don't understand consistent
very well, but my impression is it's not as safe as precise
or source
.
-fp-model=consistent
is the most safe, but I think is only available for ifort (well the man page indicated ifort only, but icpx accepts it). I can try adjusting flags next.
-fp-model=consistent is the most safe, but I think is only available for ifort (well the man page indicated ifort only, but icpx accepts it). I can try adjusting flags next.
Ok, if consistent
is at least as safe or meaningless for ifx, then that can't be it. Have you tried an ifort rather than ifx build?
Oh, I wasn't thinking when I wrote "ifx" above. On pm-cpu, it's the C++ compiler that matters, as all the dycore code is in C++.
Chrysalis uses icpc (ICC) 19.1.3.304 20200925.
Currently, it looks like many C++ files are being built ignoring flags set via CIME, so I've been editing externals/ekat/cmake/EkatSetCompilerFlags.cmake
to experiment.
I found that on pm-cpu with Intel, I can get compare PASS if I build with the default -O3
and then add -fp-model=precise
to ekat C++ files.
string(APPEND CMAKE_CXX_FLAGS_RELEASE " -O3")
string(APPEND CMAKE_CXX_FLAGS_RELEASE " -fp-model=precise")
If this deemed acceptable, would need to make a change to ekat.
It might be a good time to set compiler flags in CIME, and let ekat/scream use them.