FMS
FMS copied to clipboard
Improve RRFS model startup time.
The RRFS production team requested help understanding why their ensemble runs bring the Lustre file systems on Cactus and Dogwood to their knees (Ticket#2023032810000014). Pete Johnsen reports file system utilization of ~400 GB/s on a disk based Lustre file system. Furthermore, job startup times increase with the number of ensemble members. I've measured 310s startup times using 24 members running on Cactus:/lfs/h2.
An investigation revealed an enormous number of unexpectedly small read operations (< 2KB) during model startup. The following files were implicated.
phy_data.nc sfc_data.nc fv_tracer.res.tile1.nc fv_core.res.tile1.nc C3463_grid.tile7.nc
Altering file striping didn't resolve the problem though it does play a role in the final solution. Altering file chunking didn't resolve the problem though it does play a role in the final solution. Forcing the reads onto rank zero then broadcasting to the appropriate rank did improve startup time, but did not increase the size of the reads.
Steps to reproduce the behavior I have a small code on the Cactus system that reads a single variable from the fv_core.res.tile1.nc file. I've used this unit test to evaluate potential solutions. Let me know if you want this code.
The NOAA production systems. Currently Loaded Modules:
- craype-x86-rome (H) 7) intel/19.1.3.304 13) zlib/1.2.11
- libfabric/1.11.0.0. (H) 8) cray-mpich/8.1.12 14) libpng/1.6.37
- craype-network-ofi (H) 9) cray-pals/1.2.2 15) libjpeg/9c
- envvar/1.0 10) netcdf/4.7.4 16) udunits/2.2.28
- craype/2.7.17 11) hdf5/1.10.6
- PrgEnv-intel/8.3.3 12) jasper/2.0.25
The proposed solution is the combination of the following changes
- enable MPI-IO collective buffering within the FMS source code (the subject of the current discussion)
- use appropriate NetCDF variable chunking (one chunk per z-level)
- use alternate Lustre file striping (one stripe per available disk OST each of size 2MB)
- setting MPICH_MPIIO_HINTS in the PBS script
The proposed solution has been tested on the same RRFS case and results in file system utilization of ~80 GB/s for only ~77 seconds running 24 members. The proposed FMS code modifications include specifying NF90_MPIIO in the mode argument to nf90_open in fms2_io/netcdf_io.F90. I'm doing this only for specified files.
use mpi, only: MPI_COMM_WORLD, MPI_INFO_NULL
.
.
.
if(string_compare(trim(fileobj%path), "INPUT/phy_data.nc" , .true.) .or. &
string_compare(trim(fileobj%path), "INPUT/fv_tracer.res.tile1.nc", .true.) .or. &
string_compare(trim(fileobj%path), "INPUT/sfc_data.nc" , .true.) .or. &
string_compare(trim(fileobj%path), "INPUT/C3463_grid.tile7.nc" , .true.) .or. &
string_compare(trim(fileobj%path), "INPUT/C3463_grid.tile7.halo3.nc", .true.) .or. &
string_compare(trim(fileobj%path), "INPUT/fv_core.res.tile1.nc" , .true.) ) then
err = nf90_open(trim(fileobj%path), ior(NF90_NOWRITE, NF90_MPIIO), fileobj%ncid, comm=MPI_COMM_WORLD, info=MPI_INFO_NULL)
else
err = nf90_open(trim(fileobj%path), nf90_nowrite, fileobj%ncid, chunksize=fms2_ncchksz)
endif
The other code change is to call nf90_var_par_access() with the nf90_collective option for all variables in specified files before calling nf90_get_var(). This is done in fms2_io/include/netcdf_read_data.inc for r4 and r8 2D and 3D variables.
e.g.
if(string_compare(trim(fileobj%path), "INPUT/phy_data.nc" , .true.) .or. &
string_compare(trim(fileobj%path), "INPUT/sfc_data.nc", .true.) ) then
err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective)
endif
The proposed code changes do result in a "double free or corruption" at the end of the run. I have a suspicion that this is coming from my direct use of MPI_COMM_WORLD and/or MPI_INFO_NULL. I need some help here.
I'm also not sure we want to hard-code the file names in the FMS code. I need some suggestions for a solution here.
I have started a branch with the above changes. https://github.com/dkokron/FMS/tree/fms.ParallelStartup
Posting IO performance profiles for 3 different methods reading RRFS restart files on WCOSS2 Lustre file system. These profiles are from a quiet disk based Lustre file system (HPE ClusterStor E1000) when running 30 ensemble members, 14 nodes each member.
- default FMS method where all MPI ranks read the input files
- FMS 2023.01.01 update where single MPI rank reads input files and uses MPP_Scatter to distribute data
- Parallel NetCDF4 prototype from Dan Kokron (fms.ParallelStartup branch)
The goal is to reduce pressure on the file system while achieving the shortest model initialization times when running many ensemble members concurrently.
@dkokron make sure you pull the updates to main
. There are a lot of changes coming with the next release and I wouldn't want your code to get left behind and have large merge conflicts.
@pj-gdit - I believe you had some standalone tests that were used to generate this data, is it possible you could share that test code with us? I am interested in prototyping some solutions that could allow us to properly augment the existing IO layer.
@thomas-robinson Done.
On Tue, Sep 26, 2023, 8:29 AM Tom Robinson @.***> wrote:
@dkokron https://github.com/dkokron make sure you pull the updates to main. There are a lot of changes coming with the next release and I wouldn't want your code to get left behind and have large merge conflicts.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/FMS/issues/1322#issuecomment-1735546901, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACODV2HRWVV5TSVVEEU2DYTX4LKDRANCNFSM6AAAAAA3I3ETRM . You are receiving this because you were mentioned.Message ID: @.***>
Pete used something I put together. I'll package it up and send it to you.
On Wed, Sep 27, 2023, 7:17 AM Rusty Benson @.***> wrote:
@pj-gdit https://github.com/pj-gdit - I believe you had some standalone tests that were used to generate this data, is it possible you could share that test code with us? I am interested in prototyping some solutions that could allow us to properly augment the existing IO layer.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/FMS/issues/1322#issuecomment-1737278350, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACODV2BB7PGIIOC3AWBJTD3X4QKORANCNFSM6AAAAAA3I3ETRM . You are receiving this because you were mentioned.Message ID: @.***>
@pj-gdit - I believe you had some standalone tests that were used to generate this data, is it possible you could share that test code with us? I am interested in prototyping some solutions that could allow us to properly augment the existing IO layer.
See attached ForRustyBenson.tgz
You'll only need to re-stripe the fv_core.res.tile1.nc file for this unit tester. I mistakenly included instructions in the README for re-striping the other files too.
@dkokron @pj-gdit thanks for opening this issue.
@bensonr and @thomas-robinson Is it possible for this item to get into the repository in some accelerated/expedited fashion? This is critical for the eventual operational implementations of RRFS and 3DRTMA. Thanks!
Am tagging @junwang-noaa as she has expressed an interest in this topic as well.
@JacobCarley-NOAA - As there have been conversations in email and now here going on concurrently, I'm copying the reply I sent in to others earlier this week.
We are starting to prototype an IO offload system, something we've been talking about for years now. Adding the proposed NetCDF4 updates are something we could look to incorporate as part of the work, but I don't expect anything to be available within the next six months.
I know you most likely need this sooner and since you have knowledge of parallel NetCDF4 and have delved into FMS, I'd encourage you add this as an option to the fms2_io subsystem and submit a PR. The FMS infrastructure library is an important part of our modeling system and any contributions would need to adhere to our guidelines and pass our tests. If this is something you are willing to do, I'd suggest a putting together a short project plan and we (GFDL) can look it over and, if needed, have a meeting to discuss.
Fixed in #1477