global-workflow icon indicating copy to clipboard operation
global-workflow copied to clipboard

gdaseupd memory issues on Hera

Open DavidHuber-NOAA opened this issue 1 year ago • 6 comments

What is wrong?

Periodically, the gdaseupd job fails on Hera with memory issues when run at C384. The job usually runs successfully on the second iteration. Reported by @wx20jjung @CatherineThomas-NOAA.

What should have happened?

The job should have enough memory to complete successfully.

What machines are impacted?

Hera

Steps to reproduce

Run a C384 experiment with 80 members. Eventually, an eupd job will fail.

Additional information

N/A

Do you have a proposed solution?

@wx20jjung found a solution to be to change the runtime layout to 5 PEs per node with 8 threads (instead of 8 PEs/5 threads) and 80 PEs total (instead of 270). This resulted in much shorter wait times and only about 5 minutes longer run time.

DavidHuber-NOAA avatar Apr 03 '24 12:04 DavidHuber-NOAA

@wx20jjung Can you please share the configuration that you used for workflow/setup_expt.py? I would like to test your configuration before submitting a PR to address this issue. Thank you.

HenryRWinterbottom avatar May 21 '24 18:05 HenryRWinterbottom

I changed my configuration after setup_expt.py. I changed my configuration directly within the config files and the *.xml.

On Tue, May 21, 2024 at 2:49 PM Henry R. Winterbottom < @.***> wrote:

@wx20jjung https://github.com/wx20jjung Can you please share the configuration that you used for workflow/setup_expt.py? I would like to test your configuration before submitting a PR to address this issue. Thank you.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2123241949, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA5DJTCODPLKV4MUJGLZDOJJVAVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGI2DCOJUHE . You are receiving this because you were mentioned.Message ID: @.***>

wx20jjung avatar May 21 '24 21:05 wx20jjung

@wx20jjung Can you please provide me the commands that you passed to setup_expt.py before you changed the XML? I want to run your configuration to make sure the processor topology is not a red herring. Thank you.

HenryRWinterbottom avatar May 21 '24 23:05 HenryRWinterbottom

Henry, all of my edits came after running setup_expt.py and setup_xml.py. Here is what I used on hera:

setup_expt.py gfs cycled --idate 2024013118 --edate 2024040518 --app ATM --gfs_cyc 1 --resdetatmos 384 --resensatmos 192 --nens 80 --cdump GDAS --pslot test --comroot /scratch1/NCEPDEV/stmp2/Jim.Jung --expdir /home/Jim.Jung/para --icsdir /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/gdas_init/output

setup_xml.py /home/Jim.Jung/para/test

I then edited cofig.resources to 80 cores (tasks) < export npe_eupd=80

    export npe_eupd=270

and test.xml to 16 nodes and switched the task/thread ratio < 16:ppn=5:tpp=8

  <nodes>54:ppn=8:tpp=5</nodes>

There are no openmp statements in the eupd code, so the task/thread ratio just limits the number of tasks on a node.

On Tue, May 21, 2024 at 7:37 PM Henry R. Winterbottom < @.***> wrote:

@wx20jjung https://github.com/wx20jjung Can you please provide me the commands that you passed to setup_expt.py before you changed the XML? I want to run your configuration to make sure the processor topology is not a red herring. Thank you.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2123596195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASAZRO4CREFXQJIIAQ33ZDPLCXAVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGU4TMMJZGU . You are receiving this because you were mentioned.Message ID: @.***>

wx20jjung avatar May 22 '24 01:05 wx20jjung

@wx20jjung Can you please open /home/Jim.Jung/para or move your EXPDIR somewhere that I can read it?

HenryRWinterbottom avatar May 22 '24 17:05 HenryRWinterbottom

All of my para directory is copied to /scratch1/NCEPDEV/jcsda/Jim.Jung/save/para

On Wed, May 22, 2024 at 1:52 PM Henry R. Winterbottom < @.***> wrote:

@wx20jjung https://github.com/wx20jjung Can you please open /home/Jim.Jung/para or move your EXPDIR somewhere that I can read it?

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2125423052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA6XBUDUEO4EZWIYP2LZDTLO3AVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVGQZDGMBVGI . You are receiving this because you were mentioned.Message ID: @.***>

wx20jjung avatar May 22 '24 19:05 wx20jjung

Not sure whether it is related, but issue #https://github.com/NOAA-EMC/global-workflow/issues/2506 also reported enkfgdaseupd task crashes.

guoqing-noaa avatar May 23 '24 19:05 guoqing-noaa

@wx20jjung I have been unable to get the configuration that you passed me here to work.

Instead I created a branch where I applied your suggested fix. In addition, I have built the branch on RDHPCS and that can be found at /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454. Please try to use either the repo or the compiled G-W to test the configuration that I linked below.

Once you do, please pass me the path to your Rocoto files. Thank you in advance.

HenryRWinterbottom avatar May 24 '24 18:05 HenryRWinterbottom

@Henry Winterbottom - NOAA Affiliate @.***> Is there a reason you did not flip the task/thread ratio? This is the most important change. In config.resources, at line 1033 try changing export nth_eupd=5 to export nth_eupd=8 This allows more memory per task, which is needed.

On Fri, May 24, 2024 at 2:04 PM Henry R. Winterbottom < @.***> wrote:

@wx20jjung https://github.com/wx20jjung I have been unable to get the configuration that you passed me here https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2123731785 to work.

Instead I created a branch https://github.com/HenryWinterbottom-NOAA/global-workflow/tree/feature/gwdev_issue_2454 where I applied your suggested fix. In addition, I have built the branch on RDHPCS and that can be found at /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454. Please try to use either the repo or the compiled G-W to test the configuration that I linked below.

Once you do, please pass me the path to your Rocoto files. Thank you in advance.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/2454#issuecomment-2130101608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMPASA4SWQYZ3CNM3NLZBNLZD56KTAVCNFSM6AAAAABFVHJJBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZQGEYDCNRQHA . You are receiving this because you were mentioned.Message ID: @.***>

wx20jjung avatar May 24 '24 19:05 wx20jjung

@wx20jjung Thank you. That was a typo/oversight on my part.

Please try again.

HenryRWinterbottom avatar May 24 '24 19:05 HenryRWinterbottom

@HenryWinterbottom-NOAA I ran both python setup scripts from your /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454. I had to change the nth_eupd=5 to nth_eupd=8 and change the task/thread ratio in the *.xml file from 16:ppn=8:tpp=5 to 16:ppn=5:tpp=8. These changes would not be necessary if the appropriate change were made in the global-workflow/parm/config/gfs/config.resources file.

As you requested, I started a cycling experiment. This is using your /scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workiflow.gwdev_issue2454. The comroot directory is /scratch1/NCEPDEV/stmp2/Jim.Jung/scrub/HW_test. I have successfully run through 2 cycles of eupd. The run time log files are in the appropriate directory under $comroot/log.

wx20jjung avatar May 28 '24 12:05 wx20jjung

@wx20jjung Can you please point me to your Rocoto xml file?

HenryRWinterbottom avatar May 29 '24 15:05 HenryRWinterbottom

@HenryWinterbottom-NOAA I put the file here: /scratch1/NCEPDEV/jcsda/Jim.Jung/noscrub/HW_test.xml

wx20jjung avatar May 29 '24 15:05 wx20jjung