E3SM
E3SM copied to clipboard
module changes after perlmutter downtime - maint-2.1
I am having trouble submitting jobs on pm using maint-2.1. The problem started today, notably after the machine downtime yesterday.
I followed the error codes to load upgraded modules, but I'm not able to figure out how to get past this error: v21.LR.BSMYLE.1995-11.001/case_scripts.014> module --ignore-cache load "cray-netcdf-hdf5parallel/4.9.0.7" Lmod has detected the following error: The following module(s) are unknown: "cray-netcdf-hdf5parallel/4.9.0.7"
Here are my env_mach_specific.xml settings:
<command name="load">craype</command>
<command name="load">cray-libsci</command>
<command name="load">cray-mpich/8.1.28</command>
<command name="load">cray-hdf5-parallel/1.12.2.9</command>
<command name="load">cray-netcdf-hdf5parallel/4.9.0.7</command>
<command name="load">cray-parallel-netcdf/1.12.3.9</command>
<command name="load">cmake/3.22.0</command>
I also added this directly to my env_mach_specific.xml file:
Thanks in advance for any ideas!
OK, looks like I need to update the branches. E3SM master does have module versions that will work if you want to copy those for now.
Thanks! (Can you point me in the right direction on where to find a list of the working module versions)
Thank you!
Actually, it looks like that branch already had updated modules. I think you just have not pulled recently enough.
With fresh clone of maint-2.1, you should see:
<modules>
<command name="load">craype-accel-host</command>
<command name="load">craype/2.7.20</command>
<command name="load">cray-mpich/8.1.25</command>
<command name="load">cray-hdf5-parallel/1.12.2.3</command>
<command name="load">cray-netcdf-hdf5parallel/4.9.0.3</command>
<command name="load">cray-parallel-netcdf/1.12.3.3</command>
<command name="load">cmake/3.24.3</command>
</modules>
I'm still going to make a change to this maint branch and others to update PE layouts.
I had to make these module updates in order to do a case.setup:
<modules>
<command name="load">craype-accel-host</command>
<command name="load">craype/2.7.20</command>
<command name="load">cray-mpich/8.1.28</command>
<command name="load">cray-hdf5-parallel/1.12.2.9</command>
<command name="load">cray-netcdf-hdf5parallel/4.9.0.3</command>
<command name="load">cray-parallel-netcdf/1.12.3.9</command>
<command name="load">cmake/3.24.3</command>
</modules>
But I'm still getting this error:
ERROR: module command /usr/share/lmod/lmod/libexec/lmod python load craype-accel-host craype/2.7.20 cray-mpich/8.1.28 cray-hdf5-parallel/1.12.2.9 cray-netcdf-hdf5parallel/4.9.0.3 cray-parallel-netcdf/1.12.3.9 cmake/3.24.3 failed with message: Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "cray-netcdf-hdf5parallel/4.9.0.3" Try: "module spider cray-netcdf-hdf5parallel/4.9.0.3" to see how to load the module(s). v21.LR.BSMYLE.1995-11.001/case_scripts.014> module avail cray-netcdf-hdf5parallel No module(s) or extension(s) found!
You may made other changes. If you can try checking out fresh clone of maint-2.1 and build a test there, then it means you just have some differences between your config_machines.xml and the one in the repo.
Note I just merged the following PR to maint-2.1, but it should have no impact here as there are no needed module version changes.
https://github.com/E3SM-Project/E3SM/pull/6158