scream icon indicating copy to clipboard operation
scream copied to clipboard

[WIP] Move Frontier build to gfortran and latest rocm

Open trey-ornl opened this issue 1 year ago • 26 comments

Added new cime_config/machines/cmake_macros/craygnu-hipcc.cmake file, which uses the Cray compiler wrappers with Gnu compilers, along with hipcc for AMD GPUs. Changed Adios2 path to an appropriate OLCF build. Deleted all Crusher files. Crusher no longer exists.

trey-ornl avatar Aug 26 '24 19:08 trey-ornl

I used this branch, before the last merge with master, to try to reproduce the E3SM failure on the "bad node" of Frontier, which now has the alias borg106. The run did fail, while runs on other nodes completed without failing. Diffs of the e3sm.log.* files eventually varied in their hashes, however, typically starting with exxhash around line 569. Here are the runs:

/lustre/orion/world-shared/cli115/trey/olcfhelp-14845/case/2024-08-*

I gunzip-ed the e3sm.log files for the successful runs.

trey-ornl avatar Aug 26 '24 19:08 trey-ornl

A two-node run of ne30 ran to completion on Frontier with this branch before the final merge with master.

/lustre/orion/world-shared/cli115/trey/scream-ne30-2node/case

trey-ornl avatar Aug 26 '24 19:08 trey-ornl

Am I correct that the current s/w stack can't be run from this branch?

ambrad avatar Aug 26 '24 19:08 ambrad

Am I correct that the current s/w stack can't be run from this branch?

Yes, as written, it replaces the current stack. Shall I modify it so that the previous stack can still be selected? It could key off of the compiler choice.

The change to components/eamxx/CMakeLists.txt could be tricky.

trey-ornl avatar Aug 26 '24 19:08 trey-ornl

Tagging @rljacob @ndkeen @jgfouca @PeterCaldwell @AaronDonahue.

Trey, I don't know the right answers to your questions in terms of what everyone wants. But in terms of technical steps:

  1. We can give machines/compiler entries uniquifying names to permit parallel stacks.
  2. The CMakeLists.txt change could probably be handled with slightly better conditionals. In particular, I think querying the Fortran compiler ID might help, e.g., something based on CMAKE_Fortran_COMPILER_ID MATCHES "gfortran".

ambrad avatar Aug 26 '24 20:08 ambrad

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

E3SM-Bot avatar Aug 26 '24 20:08 E3SM-Bot

I marked this PR as WIP so that the autotester doesn't run on it until it's ready.

ambrad avatar Aug 26 '24 20:08 ambrad

I've isolated at least one nondeterministic diff to P3 (small-kernels version). I'll first try the deopt that worked for the (differently expressed) issue with P3 using ROCm 5.7. I'll also look at the monolithic version of P3. If nothing works, then I'll have to do some manual work to isolate the problem further.

ambrad avatar Aug 26 '24 21:08 ambrad

@trey-ornl the P3 deoptimization is working. Thus, in your performance testing, if you haven't started already, I suggest adding this:

diff --git a/cime_config/machines/cmake_macros/craygnu-hipcc.cmake b/cime_config/machines/cmake_macros/craygnu-hipcc.                                                                                                                        
cmake                                                                                                                                                                                                                                        
index 8322c5d3a9..6cb79c0146 100644                                                                                                                                                                                                          
--- a/cime_config/machines/cmake_macros/craygnu-hipcc.cmake                                                                                                                                                                                  
+++ b/cime_config/machines/cmake_macros/craygnu-hipcc.cmake                                                                                                                                                                                  
@@ -5,7 +5,7 @@ set(SCC "cc")                                                                                                                                                                                                                
 set(SCXX "hipcc")                                                                                                                                                                                                                           
 set(SFC "ftn")                                                                                                                                                                                                                              
                                                                                                                                                                                                                                             
-string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU")                                                                                                                                                                     
+string(APPEND CPPDEFS " -DLINUX -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU -DSCREAM_SYSTEM_WORKAROUND_P3_PART2")                                                                                                                                 
 if (COMP_NAME STREQUAL gptl)                                                                                                                                                                                                                
     string(APPEND CPPDEFS " -DHAVE_NANOTIME -DBIT64 -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDA                                                                                                                        
Y")                                                                                                                                                                                                                                          
 endif()                                                                                                                                                                                                                                     
diff --git a/components/eamxx/src/physics/p3/disp/p3_main_impl_part2_disp.cpp b/components/eamxx/src/physics/p3/disp/                                                                                                                        
p3_main_impl_part2_disp.cpp                                                                                                                                                                                                                  
index 2b619d54bf..28445a35c3 100644                                                                                                                                                                                                          
--- a/components/eamxx/src/physics/p3/disp/p3_main_impl_part2_disp.cpp                                                                                                                                                                       
+++ b/components/eamxx/src/physics/p3/disp/p3_main_impl_part2_disp.cpp                                                                                                                                                                       
@@ -9,7 +9,9 @@ namespace p3 {                                                                                                                                                                                                               
  * Implementation of p3 main function. Clients should NOT #include                                                                                                                                                                          
  * this file, #include p3_functions.hpp instead.                                                                                                                                                                                            
  */                                                                                                                                                                                                                                         
-                                                                                                                                                                                                                                            
+#ifdef SCREAM_SYSTEM_WORKAROUND_P3_PART2                                                                                                                                                                                                    
+# pragma clang optimize off                                                                                                                                                                                                                 
+#endif                                                                                                                                                                                                                                      
 template <>                                                                                                                                                                                                                                 
 void Functions<Real,DefaultDevice>                                                                                                                                                                                                          
 ::p3_main_part2_disp(                                                                                                                                                                                                                       
@@ -130,7 +132,9 @@ void Functions<Real,DefaultDevice>                                                                                                                                                                                       
     if (!hydrometeorsPresent(i)) return;                                                                                                                                                                                                    
   });                                                                                                                                                                                                                                       
 }                                                                                                                                                                                                                                           
-                                                                                                                                                                                                                                            
+#ifdef SCREAM_SYSTEM_WORKAROUND_P3_PART2                                                                                                                                                                                                    
+# pragma clang optimize on                                                                                                                                                                                                                  
+#endif                                                                                                                                                                                                                                      
 } // namespace p3                                                                                                                                                                                                                           
 } // namespace scream

I'll edit this comment with more information as I gather it.

First, with the above, I got 11 passes, 0 fails, of ERS_P8x1_Ln90.ne30pg2_ne30pg2.F2010-SCREAMv1.frontier-scream-gpu_craygnu-hipcc.scream-internal_diagnostics_level. All final bfbhash lines are identical. Without the fix, this test failed every time.

Second, I tried monolithic-kernel P3 and get

0: exxhash>    1-  0.00000 0 de8745af60d78ad9 (p3-pre-sc-0)
1: Memory access fault by GPU node-9 (Agent handle: 0x18b37710) on address 0x7ffefffff000. Reason: Unknown.
2: Memory access fault by GPU node-6 (Agent handle: 0x18b37710) on address 0x7ffec094c000. Reason: Unknown.
6: Memory access fault by GPU node-4 (Agent handle: 0x18b37710) on address 0x7ffeffff9000. Reason: Unknown.
4: Failed to allocate file: Bad file descriptor
4: GPU core dump failed
0: Failed to allocate file: Bad file descriptor
0: GPU core dump failed
4: Memory access fault by GPU node-10 (Agent handle: 0x18b37710) on address 0x7ffd99999000. Reason: Unknown.

The core file with gdb doesn't give me a stack trace, but the hasher output is enough to know that the crash is in P3, as we would expect given this one change.

Third, I tried to deopt the corresponding "part2" section of code. It either doesn't solve the problem or the pragma doesn't work within a kernel. If I put the pragma outside the kernel invocation (deoptimizing the entire monolithic P3), then the test runs. I'm at 5 passes, 0 fails, all final bfbhash lines identical. The lines are also identical to the small-kernels-P3 case, although that isn't necessary.

ambrad avatar Aug 26 '24 21:08 ambrad

@trey-ornl you should key off the compiler choice but we want to deprecate all machine entries with "scream" in the title. Maybe now would be a good time to do that.

rljacob avatar Aug 26 '24 22:08 rljacob

Also I think the E3SM convention for this compiler option name is "craygnugpu" . We have not put the vendor-specific GPU language in the names. Not sure if we should.

rljacob avatar Aug 26 '24 22:08 rljacob

Why would the compiler name not be just gnugpu?

ndkeen avatar Aug 26 '24 22:08 ndkeen

Also I think the E3SM convention for this compiler option name is "craygnugpu" . We have not put the vendor-specific GPU language in the names. Not sure if we should.

"craygnugpu" says to me that the Cray wrappers will use the Gnu compilers to offload to GPUs. That is not the case here. The intended message for the name to convey is this: "Use Cray wrappers with Gnu compilers for the host, and use the AMD Hip compiler for the GPU (currently named 'hipcc')." Maybe "craygnuamdpgu" or "craygnu-amdgpu"?

trey-ornl avatar Aug 26 '24 22:08 trey-ornl

Why would the compiler name not be just gnugpu?

The name "gnugpu" says to me that the Gnu compilers are used directly ("g++", "gfortran"), and that they are used to compile for the GPUs. Neither is true in this case.

trey-ornl avatar Aug 26 '24 22:08 trey-ornl

@trey-ornl good point about the GPU compile. I guess craygnuamdpgu is what we should use.

Yes "craygnu" implies cray wrappers while "gnu" is just calling gfortran directly.

Ignoring ones with "scream", we currently have:

amdclang_frontier.cmake				crayclang_frontier.cmake
amdclanggpu_frontier.cmake			crayclanggpu_frontier.cmake
gnu_frontier.cmake		gnugpu_frontier.cmake

So we'd be adding "craygnuamdgpu_frontier.cmake"

rljacob avatar Aug 27 '24 02:08 rljacob

Ignore my comment about editing the "frontier" entry. We'll do that later. Go ahead and add to frontier-scream-gpu

rljacob avatar Aug 27 '24 02:08 rljacob

@ambrad, are the tests in components/eamxx/src/physics/p3/tests useful? Do you know if they detect the issue in p3_main_part2?

trey-ornl avatar Aug 29 '24 15:08 trey-ornl

I don't use those tests for nondeterminism analysis and have not run them in years. I consider a single-node ne30 ERS test to be the most useful test configuration.

ambrad avatar Aug 29 '24 16:08 ambrad

Why did you merge master in to this feature branch? We discourage that unless you have a specific reason.

rljacob avatar Aug 30 '24 01:08 rljacob

Why did you merge master in to this feature branch? We discourage that unless you have a specific reason.

Newbie mistake. How do you test your branch with all the latest changes without merging them in first? Is the expectation that we only test our changes relative to where the branch started?

Please let me know if there is TFM that I should R.

trey-ornl avatar Aug 30 '24 14:08 trey-ornl

To answer your question, I think CI does a merge to master and tests that. You could also do a merge to master in your local copy, test, then reset.

rljacob avatar Aug 31 '24 02:08 rljacob

How do you test your branch with all the latest changes without merging them in first? Is the expectation that we only test our changes relative to where the branch started?

Rebase (git fetch --all && git pull --rebase upstream master) and then force-push (git push --force)

mahf708 avatar Sep 03 '24 01:09 mahf708

If you are collaboratively working with someone else, it's even better to use git push --force-with-lease [...].

From the docs:

--force-with-lease alone, without specifying the details, will protect all remote refs that are going to be updated by requiring their current value to be the same as the remote-tracking branch we have for them. So if someone else pushed something to the remote branch in the meantime, your force push will fail.

bartgol avatar Sep 05 '24 21:09 bartgol

@trey-ornl , is this PR ready to be "un-WIP'ed" and tested?

AaronDonahue avatar Sep 17 '24 16:09 AaronDonahue

@trey-ornl , would ou mind manually testing on frontier with master merged in to make sure everything still works as expected? Once you report back the all-clear we can merge.

AaronDonahue avatar Sep 25 '24 22:09 AaronDonahue

@trey-ornl , would ou mind manually testing on frontier with master merged in to make sure everything still works as expected? Once you report back the all-clear we can merge.

Just built and ran an ne30 performance test from @ndkeen yesterday on Frontier with these changes merged into master. New build (compiler craygnuamdgpu) ran 3% faster than the old (compiler crayclang-scream).

trey-ornl avatar Sep 25 '24 22:09 trey-ornl

I did a test of the decadal run with the new craygnuamdgpu build, and it ran very very slowly. I tried the libfabric/1.20.1 workaround, and it then ran very (just one very instead of two) slowly. Then I changed the build to load module libfabric/1.15.2.0, and got performance as good as the "old" build. I updated this pull request with the change, along with an updated path to an appropriate build of Adios2 (with the new build modules + libfabric/1.15.2.0).

Timing results from 5-day decadal run on 2048 nodes of Frontier. Existing crayclang-scream:

TOT Run Time:    3788.487 seconds      757.697 seconds/mday         0.31 myears/wday 

Previous version of new craygnuamdpgu: Did not even reach the first hash output in a two-hour run. Previous version of new craygnuamdgpu with FI_MR_CACHE_MONITOR=disabled:

TOT Run Time:    5789.003 seconds     1157.801 seconds/mday         0.20 myears/wday 

Current new craygnuamdgpu with module load libfabric/1.15.2.0:

TOT Run Time:    3753.817 seconds      750.763 seconds/mday         0.32 myears/wday

We will likely want to update the modules again once Frontier has a fixed version of Libfabric.

trey-ornl avatar Sep 30 '24 21:09 trey-ornl

Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing.

E3SM-Bot avatar Sep 30 '24 21:09 E3SM-Bot

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

E3SM-Bot avatar Sep 30 '24 21:09 E3SM-Bot

Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing.

E3SM-Bot avatar Sep 30 '24 21:09 E3SM-Bot