E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

module changes after perlmutter downtime - maint-2.1

Open nanr opened this issue 1 year ago • 6 comments

I am having trouble submitting jobs on pm using maint-2.1. The problem started today, notably after the machine downtime yesterday.

I followed the error codes to load upgraded modules, but I'm not able to figure out how to get past this error: v21.LR.BSMYLE.1995-11.001/case_scripts.014> module --ignore-cache load "cray-netcdf-hdf5parallel/4.9.0.7" Lmod has detected the following error: The following module(s) are unknown: "cray-netcdf-hdf5parallel/4.9.0.7"

Here are my env_mach_specific.xml settings:

 <command name="load">craype</command>
  <command name="load">cray-libsci</command>
  <command name="load">cray-mpich/8.1.28</command>
  <command name="load">cray-hdf5-parallel/1.12.2.9</command>
  <command name="load">cray-netcdf-hdf5parallel/4.9.0.7</command>
  <command name="load">cray-parallel-netcdf/1.12.3.9</command>
  <command name="load">cmake/3.22.0</command>

I also added this directly to my env_mach_specific.xml file: PrgEnv-intel/8.5.0 intel/2023.2.0

Thanks in advance for any ideas!

nanr avatar Jan 18 '24 19:01 nanr

OK, looks like I need to update the branches. E3SM master does have module versions that will work if you want to copy those for now.

ndkeen avatar Jan 18 '24 20:01 ndkeen

Thanks! (Can you point me in the right direction on where to find a list of the working module versions)

Thank you!

nanr avatar Jan 18 '24 21:01 nanr

Actually, it looks like that branch already had updated modules. I think you just have not pulled recently enough.

With fresh clone of maint-2.1, you should see:

      <modules>
        <command name="load">craype-accel-host</command>
        <command name="load">craype/2.7.20</command>
        <command name="load">cray-mpich/8.1.25</command>
        <command name="load">cray-hdf5-parallel/1.12.2.3</command>
        <command name="load">cray-netcdf-hdf5parallel/4.9.0.3</command>
        <command name="load">cray-parallel-netcdf/1.12.3.3</command>
        <command name="load">cmake/3.24.3</command>
      </modules>

I'm still going to make a change to this maint branch and others to update PE layouts.

ndkeen avatar Jan 18 '24 21:01 ndkeen

I had to make these module updates in order to do a case.setup:

<modules>
    <command name="load">craype-accel-host</command>
    <command name="load">craype/2.7.20</command>
    <command name="load">cray-mpich/8.1.28</command>
    <command name="load">cray-hdf5-parallel/1.12.2.9</command>
    <command name="load">cray-netcdf-hdf5parallel/4.9.0.3</command>
    <command name="load">cray-parallel-netcdf/1.12.3.9</command>
    <command name="load">cmake/3.24.3</command>
  </modules>

But I'm still getting this error:

ERROR: module command /usr/share/lmod/lmod/libexec/lmod python load craype-accel-host craype/2.7.20 cray-mpich/8.1.28 cray-hdf5-parallel/1.12.2.9 cray-netcdf-hdf5parallel/4.9.0.3 cray-parallel-netcdf/1.12.3.9 cmake/3.24.3 failed with message: Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "cray-netcdf-hdf5parallel/4.9.0.3" Try: "module spider cray-netcdf-hdf5parallel/4.9.0.3" to see how to load the module(s). v21.LR.BSMYLE.1995-11.001/case_scripts.014> module avail cray-netcdf-hdf5parallel No module(s) or extension(s) found!

nanr avatar Jan 18 '24 23:01 nanr

You may made other changes. If you can try checking out fresh clone of maint-2.1 and build a test there, then it means you just have some differences between your config_machines.xml and the one in the repo.

ndkeen avatar Jan 19 '24 00:01 ndkeen

Note I just merged the following PR to maint-2.1, but it should have no impact here as there are no needed module version changes.

https://github.com/E3SM-Project/E3SM/pull/6158

ndkeen avatar Jan 19 '24 18:01 ndkeen