global-workflow
global-workflow copied to clipboard
gdaseupd memory issues on Hera
What is wrong?
Periodically, the gdaseupd job fails on Hera with memory issues when run at C384. The job usually runs successfully on the second iteration. Reported by @wx20jjung @CatherineThomas-NOAA.
What should have happened?
The job should have enough memory to complete successfully.
What machines are impacted?
Hera
Steps to reproduce
Run a C384 experiment with 80 members. Eventually, an eupd job will fail.
Additional information
N/A
Do you have a proposed solution?
@wx20jjung found a solution to be to change the runtime layout to 5 PEs per node with 8 threads (instead of 8 PEs/5 threads) and 80 PEs total (instead of 270). This resulted in much shorter wait times and only about 5 minutes longer run time.
@wx20jjung Can you please share the configuration that you used for workflow/setup_expt.py? I would like to test your configuration before submitting a PR to address this issue. Thank you.
I changed my configuration after setup_expt.py. I changed my configuration directly within the config files and the *.xml.
On Tue, May 21, 2024 at 2:49 PM Henry R. Winterbottom < @.***> wrote:
@wx20jjung https://github.com/wx20jjung Can you please share the configuration that you used for workflow/setup_expt.py? I would like to test your configuration before submitting a PR to address this issue. Thank you.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2123241949, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA5DJTCODPLKV4MUJGLZDOJJVAVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGI2DCOJUHE . You are receiving this because you were mentioned.Message ID: @.***>
@wx20jjung Can you please provide me the commands that you passed to setup_expt.py before you changed the XML? I want to run your configuration to make sure the processor topology is not a red herring. Thank you.
Henry, all of my edits came after running setup_expt.py and setup_xml.py. Here is what I used on hera:
setup_expt.py gfs cycled --idate 2024013118 --edate 2024040518 --app ATM --gfs_cyc 1 --resdetatmos 384 --resensatmos 192 --nens 80 --cdump GDAS --pslot test --comroot /scratch1/NCEPDEV/stmp2/Jim.Jung --expdir /home/Jim.Jung/para --icsdir /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/gdas_init/output
setup_xml.py /home/Jim.Jung/para/test
I then edited cofig.resources to 80 cores (tasks) < export npe_eupd=80
export npe_eupd=270
and test.xml to 16 nodes and switched the task/thread ratio
< 16:ppn=5:tpp=8
<nodes>54:ppn=8:tpp=5</nodes>
There are no openmp statements in the eupd code, so the task/thread ratio just limits the number of tasks on a node.
On Tue, May 21, 2024 at 7:37 PM Henry R. Winterbottom < @.***> wrote:
@wx20jjung https://github.com/wx20jjung Can you please provide me the commands that you passed to setup_expt.py before you changed the XML? I want to run your configuration to make sure the processor topology is not a red herring. Thank you.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2123596195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASAZRO4CREFXQJIIAQ33ZDPLCXAVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGU4TMMJZGU . You are receiving this because you were mentioned.Message ID: @.***>
@wx20jjung Can you please open /home/Jim.Jung/para or move your EXPDIR somewhere that I can read it?
All of my para directory is copied to /scratch1/NCEPDEV/jcsda/Jim.Jung/save/para
On Wed, May 22, 2024 at 1:52 PM Henry R. Winterbottom < @.***> wrote:
@wx20jjung https://github.com/wx20jjung Can you please open /home/Jim.Jung/para or move your EXPDIR somewhere that I can read it?
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2125423052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA6XBUDUEO4EZWIYP2LZDTLO3AVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVGQZDGMBVGI . You are receiving this because you were mentioned.Message ID: @.***>
Not sure whether it is related, but issue #https://github.com/NOAA-EMC/global-workflow/issues/2506 also reported enkfgdaseupd task crashes.
@wx20jjung I have been unable to get the configuration that you passed me here to work.
Instead I created a branch where I applied your suggested fix. In addition, I have built the branch on RDHPCS and that can be found at /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454. Please try to use either the repo or the compiled G-W to test the configuration that I linked below.
Once you do, please pass me the path to your Rocoto files. Thank you in advance.
@Henry Winterbottom - NOAA Affiliate @.***> Is there a reason you did not flip the task/thread ratio? This is the most important change. In config.resources, at line 1033 try changing export nth_eupd=5 to export nth_eupd=8 This allows more memory per task, which is needed.
On Fri, May 24, 2024 at 2:04 PM Henry R. Winterbottom < @.***> wrote:
@wx20jjung https://github.com/wx20jjung I have been unable to get the configuration that you passed me here https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2123731785 to work.
Instead I created a branch https://github.com/HenryWinterbottom-NOAA/global-workflow/tree/feature/gwdev_issue_2454 where I applied your suggested fix. In addition, I have built the branch on RDHPCS and that can be found at /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454. Please try to use either the repo or the compiled G-W to test the configuration that I linked below.
Once you do, please pass me the path to your Rocoto files. Thank you in advance.
— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2130101608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA4SWQYZ3CNM3NLZBNLZD56KTAVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZQGEYDCNRQHA . You are receiving this because you were mentioned.Message ID: @.***>
@wx20jjung Thank you. That was a typo/oversight on my part.
Please try again.
@HenryWinterbottom-NOAA I ran both python setup scripts from your /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454. I had to change the nth_eupd=5 to nth_eupd=8 and change the task/thread ratio in the *.xml file from
As you requested, I started a cycling experiment. This is using your /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workiflow.gwdev_issue2454. The comroot directory is /scratch1/NCEPDEV/stmp2/Jim.Jung/scrub/HW_test. I have successfully run through 2 cycles of eupd. The run time log files are in the appropriate directory under $comroot/log.
@wx20jjung Can you please point me to your Rocoto xml file?
@HenryWinterbottom-NOAA I put the file here: /scratch1/NCEPDEV/jcsda/Jim.Jung/noscrub/HW_test.xml