global-workflow gdaseupd memory issues on Hera

What is wrong?

Periodically, the gdaseupd job fails on Hera with memory issues when run at C384. The job usually runs successfully on the second iteration. Reported by @wx20jjung @CatherineThomas-NOAA.

What should have happened?

The job should have enough memory to complete successfully.

What machines are impacted?

Hera

Steps to reproduce

Run a C384 experiment with 80 members. Eventually, an eupd job will fail.

Additional information

N/A

Do you have a proposed solution?

@wx20jjung found a solution to be to change the runtime layout to 5 PEs per node with 8 threads (instead of 8 PEs/5 threads) and 80 PEs total (instead of 270). This resulted in much shorter wait times and only about 5 minutes longer run time.

Apr 03 '24 12:04 DavidHuber-NOAA

@wx20jjung Can you please share the configuration that you used for workflow/setup_expt.py? I would like to test your configuration before submitting a PR to address this issue. Thank you.

May 21 '24 18:05 HenryRWinterbottom

I changed my configuration after setup_expt.py. I changed my configuration directly within the config files and the *.xml.

On Tue, May 21, 2024 at 2:49 PM Henry R. Winterbottom < @.***> wrote:

@wx20jjung https://github.com/wx20jjung Can you please share the configuration that you used for workflow/setup_expt.py? I would like to test your configuration before submitting a PR to address this issue. Thank you.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2123241949, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA5DJTCODPLKV4MUJGLZDOJJVAVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGI2DCOJUHE . You are receiving this because you were mentioned.Message ID: @.***>

May 21 '24 21:05 wx20jjung

@wx20jjung Can you please provide me the commands that you passed to setup_expt.py before you changed the XML? I want to run your configuration to make sure the processor topology is not a red herring. Thank you.

May 21 '24 23:05 HenryRWinterbottom

Henry, all of my edits came after running setup_expt.py and setup_xml.py. Here is what I used on hera:

setup_expt.py gfs cycled --idate 2024013118 --edate 2024040518 --app ATM --gfs_cyc 1 --resdetatmos 384 --resensatmos 192 --nens 80 --cdump GDAS --pslot test --comroot /scratch1/NCEPDEV/stmp2/Jim.Jung --expdir /home/Jim.Jung/para --icsdir /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/gdas_init/output

setup_xml.py /home/Jim.Jung/para/test

I then edited cofig.resources to 80 cores (tasks) < export npe_eupd=80

    export npe_eupd=270

and test.xml to 16 nodes and switched the task/thread ratio < 16:ppn=5:tpp=8

  <nodes>54:ppn=8:tpp=5</nodes>

There are no openmp statements in the eupd code, so the task/thread ratio just limits the number of tasks on a node.

On Tue, May 21, 2024 at 7:37 PM Henry R. Winterbottom < @.***> wrote:

@wx20jjung https://github.com/wx20jjung Can you please provide me the commands that you passed to setup_expt.py before you changed the XML? I want to run your configuration to make sure the processor topology is not a red herring. Thank you.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2123596195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASAZRO4CREFXQJIIAQ33ZDPLCXAVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGU4TMMJZGU . You are receiving this because you were mentioned.Message ID: @.***>

May 22 '24 01:05 wx20jjung

@wx20jjung Can you please open /home/Jim.Jung/para or move your EXPDIR somewhere that I can read it?

May 22 '24 17:05 HenryRWinterbottom

All of my para directory is copied to /scratch1/NCEPDEV/jcsda/Jim.Jung/save/para

On Wed, May 22, 2024 at 1:52 PM Henry R. Winterbottom < @.***> wrote:

@wx20jjung https://github.com/wx20jjung Can you please open /home/Jim.Jung/para or move your EXPDIR somewhere that I can read it?

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2125423052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA6XBUDUEO4EZWIYP2LZDTLO3AVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVGQZDGMBVGI . You are receiving this because you were mentioned.Message ID: @.***>

May 22 '24 19:05 wx20jjung

Not sure whether it is related, but issue #https://github.com/NOAA-EMC/global-workflow/issues/2506 also reported enkfgdaseupd task crashes.

May 23 '24 19:05 guoqing-noaa

@wx20jjung I have been unable to get the configuration that you passed me here to work.

Instead I created a branch where I applied your suggested fix. In addition, I have built the branch on RDHPCS and that can be found at /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454. Please try to use either the repo or the compiled G-W to test the configuration that I linked below.

Once you do, please pass me the path to your Rocoto files. Thank you in advance.

May 24 '24 18:05 HenryRWinterbottom

@Henry Winterbottom - NOAA Affiliate @.***> Is there a reason you did not flip the task/thread ratio? This is the most important change. In config.resources, at line 1033 try changing export nth_eupd=5 to export nth_eupd=8 This allows more memory per task, which is needed.

On Fri, May 24, 2024 at 2:04 PM Henry R. Winterbottom < @.***> wrote:

@wx20jjung https://github.com/wx20jjung I have been unable to get the configuration that you passed me here https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2123731785 to work.

Instead I created a branch https://github.com/HenryWinterbottom-NOAA/global-workflow/tree/feature/gwdev_issue_2454 where I applied your suggested fix. In addition, I have built the branch on RDHPCS and that can be found at /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454. Please try to use either the repo or the compiled G-W to test the configuration that I linked below.

Once you do, please pass me the path to your Rocoto files. Thank you in advance.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2130101608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA4SWQYZ3CNM3NLZBNLZD56KTAVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZQGEYDCNRQHA . You are receiving this because you were mentioned.Message ID: @.***>

May 24 '24 19:05 wx20jjung

@wx20jjung Thank you. That was a typo/oversight on my part.

Please try again.

May 24 '24 19:05 HenryRWinterbottom

@HenryWinterbottom-NOAA I ran both python setup scripts from your /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454. I had to change the nth_eupd=5 to nth_eupd=8 and change the task/thread ratio in the *.xml file from 16:ppn=8:tpp=5 to 16:ppn=5:tpp=8. These changes would not be necessary if the appropriate change were made in the global-workflow/parm/config/gfs/config.resources file.

As you requested, I started a cycling experiment. This is using your /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workiflow.gwdev_issue2454. The comroot directory is /scratch1/NCEPDEV/stmp2/Jim.Jung/scrub/HW_test. I have successfully run through 2 cycles of eupd. The run time log files are in the appropriate directory under $comroot/log.

May 28 '24 12:05 wx20jjung

@wx20jjung Can you please point me to your Rocoto xml file?

May 29 '24 15:05 HenryRWinterbottom

@HenryWinterbottom-NOAA I put the file here: /scratch1/NCEPDEV/jcsda/Jim.Jung/noscrub/HW_test.xml

May 29 '24 15:05 wx20jjung

global-workflow global-workflow copied to clipboard

gdaseupd memory issues on Hera

What is wrong?

What should have happened?

What machines are impacted?

Steps to reproduce

Additional information

Do you have a proposed solution?

I then edited cofig.resources to 80 cores (tasks) < export npe_eupd=80

and test.xml to 16 nodes and switched the task/thread ratio < 16:ppn=5:tpp=8

global-workflow
global-workflow copied to clipboard