ufs-weather-model
ufs-weather-model copied to clipboard
Having trouble running with UCX on WCOSS2
Description
Attempts to run using ucx rather than slingshot for an RRFS configuration have led to failures when the model begins to start integrating. The failures are similar in appearance to model instability failures, so seems like NaNs are getting into the system somehow.
To Reproduce:
Utilize the /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/job_card.sh job card (on dogwood) to run the case (will require copying the run_dir and config_parms directories to your own space). job_card.sh_nonucx is a job card that avoids ucx and works for me.
Additional context
Very open to the idea that it is user error on my part, but could use help figuring out why it is failing the way it is.
Output
ucx failure log file on Dogwood: /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/OUTPUT_60h_41nodes_retry_newucxtest_v0.8.9
@GeorgeVandenberghe-NOAA Jun Wang recommended that I reach out to you about this issue. My attempts to use UCX for the RRFS application fail when model starts integrating. My hope is that there is something wrong with my setup, and since you have experience running it for the global application, maybe you could take a look? Thanks!
Do you have a WCOSS2 CWD with testcase, a job to run it and (possibly) source code and the build?
On Mon, Apr 22, 2024 at 5:15 PM MatthewPyle-NOAA @.***> wrote:
@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA Jun Wang recommended that I reach out to you about this issue. My attempts to use UCX for the RRFS application fail when model starts integrating. My hope is that there is something wrong with my setup, and since you have experience running it for the global application, maybe you could take a look? Thanks!
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2231#issuecomment-2070294586, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FWQPSWBB4RRI2ZOO7LY6VAQ5AVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZQGI4TINJYGY . You are receiving this because you were mentioned.Message ID: @.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
Most details you need are described in the "To reproduce" part of the issue - I do have a test setup on dogwood. I've been pointing at RRFS model executables, but could point you at a source if needed.
You can repair this in the ucx job by loading a later level of cray-mpich. When I do this the test job runs to timeout.
#module load cray-mpich-ucx/8.1.12 module load cray-mpich-ucx/8.1.19
Thanks @GeorgeVandenberghe-NOAA will give that a try!
Have confirmed that going to cray-mpich-ucx/8.1.19 solves my issue....closing the issue.
@MatthewPyle-NOAA is there any issue with using UCX?
@junwang-noaa I'm still looking into something - it definitely initializes much more quickly, but seems a bit slower beyond that point.
I lost my testcase on dogwood after the problem was closed. Do you have a CWD and source on Cactus. ?
On Tue, Apr 30, 2024 at 12:50 PM MatthewPyle-NOAA @.***> wrote:
@junwang-noaa https://github.com/junwang-noaa I'm still looking into something - it definitely initializes much more quickly, but seems a bit slower beyond that point.
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2231#issuecomment-2085244122, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FTZ3JDQGNPI7GMHRRTY76HPTAVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBVGI2DIMJSGI . You are receiving this because you were mentioned.Message ID: @.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
@GeorgeVandenberghe-NOAA I have things under /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307 on cactus. job_card.sh uses UCX, and job.card.sh_nonucx doesn't. I accidentally scrubbed some job log files from earlier today, but have seen for a 60 h forecast on 153 nodes that UCX saves about 7 minutes in time to f00 output being written, but then is about 9 minutes slower than non-UCX going from f00 to f60. So far I've just been pointing at an RRFS executable. Would you recommend recompiling code pointing at UCX modules?
The UCX stuff should be shared libraries and recompiling won't affect it. Do you have a source and build in that directory?
I'll go ahead and snag it. I had gotten rid of my testcases after the problem was closed.
On Tue, Apr 30, 2024 at 6:18 PM MatthewPyle-NOAA @.***> wrote:
@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA I have things under /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307 on cactus. job_card.sh uses UCX, and job.card.sh_nonucx doesn't. I accidentally scrubbed some job log files from earlier today, but have seen for a 60 h forecast on 153 nodes that UCX saves about 7 minutes in time to f00 output being written, but then is about 9 minutes slower than non-UCX going from f00 to f60. So far I've just been pointing at an RRFS executable. Would you recommend recompiling code pointing at UCX modules?
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2231#issuecomment-2086397856, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQL6KYZN5M2QGARNIDY77N7VAVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBWGM4TOOBVGY . You are receiving this because you were mentioned.Message ID: @.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
Okay. I'm using cray-mpich/8.1.12 for the non-UCX test. Hopefully the level of cray-mpich doesn't explain the difference.
60h forecast times Cactus (dogwood was very similar)
oo.o:The total amount of wall time = 15178.327225 ofi oou:The total amount of wall time = 14899.522355 ucx
The difference looks to be better startup times with ucx without evidence that ucx integration is then slowed.