ufs-weather-model icon indicating copy to clipboard operation
ufs-weather-model copied to clipboard

Having trouble running with UCX on WCOSS2

Open MatthewPyle-NOAA opened this issue 10 months ago • 13 comments

Description

Attempts to run using ucx rather than slingshot for an RRFS configuration have led to failures when the model begins to start integrating. The failures are similar in appearance to model instability failures, so seems like NaNs are getting into the system somehow.

To Reproduce:

Utilize the /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/job_card.sh job card (on dogwood) to run the case (will require copying the run_dir and config_parms directories to your own space). job_card.sh_nonucx is a job card that avoids ucx and works for me.

Additional context

Very open to the idea that it is user error on my part, but could use help figuring out why it is failing the way it is.

Output

ucx failure log file on Dogwood: /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307/OUTPUT_60h_41nodes_retry_newucxtest_v0.8.9

MatthewPyle-NOAA avatar Apr 08 '24 19:04 MatthewPyle-NOAA

@GeorgeVandenberghe-NOAA Jun Wang recommended that I reach out to you about this issue. My attempts to use UCX for the RRFS application fail when model starts integrating. My hope is that there is something wrong with my setup, and since you have experience running it for the global application, maybe you could take a look? Thanks!

MatthewPyle-NOAA avatar Apr 22 '24 17:04 MatthewPyle-NOAA

Do you have a WCOSS2 CWD with testcase, a job to run it and (possibly) source code and the build?

On Mon, Apr 22, 2024 at 5:15 PM MatthewPyle-NOAA @.***> wrote:

@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA Jun Wang recommended that I reach out to you about this issue. My attempts to use UCX for the RRFS application fail when model starts integrating. My hope is that there is something wrong with my setup, and since you have experience running it for the global application, maybe you could take a look? Thanks!

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2231#issuecomment-2070294586, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FWQPSWBB4RRI2ZOO7LY6VAQ5AVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZQGI4TINJYGY . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

*Lynker Technologies at * NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA avatar Apr 22 '24 17:04 GeorgeVandenberghe-NOAA

Most details you need are described in the "To reproduce" part of the issue - I do have a test setup on dogwood. I've been pointing at RRFS model executables, but could point you at a source if needed.

MatthewPyle-NOAA avatar Apr 22 '24 18:04 MatthewPyle-NOAA

You can repair this in the ucx job by loading a later level of cray-mpich. When I do this the test job runs to timeout.

#module load cray-mpich-ucx/8.1.12 module load cray-mpich-ucx/8.1.19

GeorgeVandenberghe-NOAA avatar Apr 23 '24 17:04 GeorgeVandenberghe-NOAA

Thanks @GeorgeVandenberghe-NOAA will give that a try!

MatthewPyle-NOAA avatar Apr 23 '24 17:04 MatthewPyle-NOAA

Have confirmed that going to cray-mpich-ucx/8.1.19 solves my issue....closing the issue.

MatthewPyle-NOAA avatar Apr 23 '24 19:04 MatthewPyle-NOAA

@MatthewPyle-NOAA is there any issue with using UCX?

junwang-noaa avatar Apr 30 '24 12:04 junwang-noaa

@junwang-noaa I'm still looking into something - it definitely initializes much more quickly, but seems a bit slower beyond that point.

MatthewPyle-NOAA avatar Apr 30 '24 12:04 MatthewPyle-NOAA

I lost my testcase on dogwood after the problem was closed. Do you have a CWD and source on Cactus. ?

On Tue, Apr 30, 2024 at 12:50 PM MatthewPyle-NOAA @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa I'm still looking into something - it definitely initializes much more quickly, but seems a bit slower beyond that point.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2231#issuecomment-2085244122, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FTZ3JDQGNPI7GMHRRTY76HPTAVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBVGI2DIMJSGI . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

*Lynker Technologies at * NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA avatar Apr 30 '24 14:04 GeorgeVandenberghe-NOAA

@GeorgeVandenberghe-NOAA I have things under /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307 on cactus. job_card.sh uses UCX, and job.card.sh_nonucx doesn't. I accidentally scrubbed some job log files from earlier today, but have seen for a 60 h forecast on 153 nodes that UCX saves about 7 minutes in time to f00 output being written, but then is about 9 minutes slower than non-UCX going from f00 to f60. So far I've just been pointing at an RRFS executable. Would you recommend recompiling code pointing at UCX modules?

MatthewPyle-NOAA avatar Apr 30 '24 18:04 MatthewPyle-NOAA

The UCX stuff should be shared libraries and recompiling won't affect it. Do you have a source and build in that directory?

I'll go ahead and snag it. I had gotten rid of my testcases after the problem was closed.

On Tue, Apr 30, 2024 at 6:18 PM MatthewPyle-NOAA @.***> wrote:

@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA I have things under /lfs/h2/emc/lam/noscrub/Matthew.Pyle/rrfs_optimization_20240307 on cactus. job_card.sh uses UCX, and job.card.sh_nonucx doesn't. I accidentally scrubbed some job log files from earlier today, but have seen for a 60 h forecast on 153 nodes that UCX saves about 7 minutes in time to f00 output being written, but then is about 9 minutes slower than non-UCX going from f00 to f60. So far I've just been pointing at an RRFS executable. Would you recommend recompiling code pointing at UCX modules?

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2231#issuecomment-2086397856, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQL6KYZN5M2QGARNIDY77N7VAVCNFSM6AAAAABF5IJGXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBWGM4TOOBVGY . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

*Lynker Technologies at * NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA avatar Apr 30 '24 18:04 GeorgeVandenberghe-NOAA

Okay. I'm using cray-mpich/8.1.12 for the non-UCX test. Hopefully the level of cray-mpich doesn't explain the difference.

MatthewPyle-NOAA avatar Apr 30 '24 19:04 MatthewPyle-NOAA

60h forecast times Cactus (dogwood was very similar)

oo.o:The total amount of wall time = 15178.327225 ofi oou:The total amount of wall time = 14899.522355 ucx

The difference looks to be better startup times with ucx without evidence that ucx integration is then slowed.

GeorgeVandenberghe-NOAA avatar Jul 29 '24 14:07 GeorgeVandenberghe-NOAA