global-workflow icon indicating copy to clipboard operation
global-workflow copied to clipboard

The gfsgenesis task failed on Jet

Open guoqing-noaa opened this issue 1 year ago • 4 comments

What is wrong?

The error message:


+ Unknown[43]: [[ -d /gpfs/hps ]]
+ Unknown[48]: [[ -L /usrx ]]
+ Unknown[53]: [[ -d /scratch2 ]]
+ Unknown[58]: [[ -d /work ]]
+ Unknown[63]: [[ -d /lfs3 ]]
+ Unknown[68]: [[ -d /lfs/h1 ]]
+ Unknown[74]: machine=unknown
+ Unknown[74]: export machine
+ Unknown[75]: echo Job failed: unknown platform
+ Unknown[75]: 1>& 2
Job failed: unknown platform
+ Unknown[76]: err_exit 'FAILED genesis.1116443 - ERROR IN unknown platform - ABNORMAL EXIT'

-------------------------------------------------------------
-- FATAL ERROR: FAILED genesis.1116443 - ERROR IN unknown platform - ABNORMAL EXIT
-- ABNORMAL EXIT at Sat May 25 04:16:36 UTC 2024 on k3
-------------------------------------------------------------

What should have happened?

The gfsgenesis task should complete successfully

What machines are impacted?

Jet

Steps to reproduce

Run a forecast-only experiment on Jet will reproduce the error.

Additional information

The error happens in /lfs4/HFIP/hfv3gfs/glopara/git/TC_tracker/feature-GFSv17_com_reorg/scripts/exgfs_tc_genesis.sh

Where line 128-131 reads as follows:

elif [[ -d /lfs3 ]] ; then
  # We are on NOAA Jet
  machine=jet
  ${USHens_tracker}/extrkr_gen_gfs.sh ${loopnum} ${cmodel} ${pert} ${pertdir} #2>&1 >${outfile}

Jet no longer has the /lfs3 directory

Do you have a proposed solution?

update TC_tracker to a more recent version so that it can detect Jet correctly.

guoqing-noaa avatar May 25 '24 15:05 guoqing-noaa

Or can we update ens_tracker_ver in run.spack.ver from export ens_tracker_ver=feature-GFSv17_com_reorg to export ens_tracker_ver=v1.1.15.6

guoqing-noaa avatar May 25 '24 16:05 guoqing-noaa

@HananehJafary-NOAA @InnocentSouopgui-NOAA FYI

WalterKolczynski-NOAA avatar Sep 10 '24 19:09 WalterKolczynski-NOAA

Please see the following issue: https://github.com/NOAA-EMC/global-workflow/issues/2841 After the failure of /lfs4, libraries on S4 are moving to /lfs5

The following pull request https://github.com/NOAA-EMC/global-workflow/pull/2878 has the fix for pretty much everything. Unfortunately, we can not push it through yet, because of a bug in one one component, that will affect other systems.

Also the TC_Tracker component has not moved yet, but will be moving soon.

If you want try the version with the fix, let me know.

You can also mirror the fix in your local working copy.

For instance, you can replace /lfs3 or /lfs1 by /lsf5 in /lfs4/HFIP/hfv3gfs/glopara/git/TC_tracker/feature-GFSv17_com_reorg/scripts/exgfs_tc_genesis.sh

Proceed with caution if you want to fix your local copy, as there is no guarantee of /lfs4 being mounted on the compute nodes.

InnocentSouopgui-NOAA avatar Sep 10 '24 20:09 InnocentSouopgui-NOAA

I have migrated all the /lfs4 filesystem to /lfs5 for TC_Tracker on Jet. You can clone the updated package under my directory: https://github.com/HananehJafary-NOAA/tracker_package

HananehJafary-NOAA avatar Sep 10 '24 22:09 HananehJafary-NOAA

OBE

DavidHuber-NOAA avatar Mar 24 '25 19:03 DavidHuber-NOAA