docs-csm
docs-csm copied to clipboard
CASMTRIAGE-7099-1.6 port of CSM 1.4 full system power off/on procedures to CSM 1.6
CASMTRIAGE-7099-1.6 port of CSM 1.4 full system power off/on procedures to CSM 1.6
Description
Several smaller changes are included in the branch from [CASMTRIAGE-7027]:
CASMTRIAGE-7028: Correct SDU collection directory mountpoint CASMTRIAGE-7029:Adjust preparation step to check HSN status to be single command line CASMTRIAGE-7033: filter sat status command to remove nodes in Off state CASMTRIAGE-7032: Added power down steps to disable PBS queues. Added reminder to enable Slurm and PBS queues during power up CASMTRIAGE-7031: Added Preparation step to warn users and operations staff of impending system power down CASMTRIAGE-7037: Update env variables to allow better copy/pasting of variables to 2nd window, adjust ncn-shutdown-timeout to 1200 CASMTRIAGE-7051: Improved sat checks after booting compute and application nodes to reduce amount of output from nodes with desired settings CASMTRIAGE-7036: Added checks for current etcd backups and alarms to power down of Kubernetes cluster CASMTRIAGE-7039: Install psmisc rpm to have tools to find processes preventing filesystem unmounting CASMTRIAGE-7046: Improve power up checks for Kubernetes: Uptime on all NCNs, move spire-jwks remediation earlier since it is always needed, note cray-cps-cm-pm pods will be in error until CFS runs CASMTRIAGE-7049: confirm after SAT command that all Chassis are On CASMTRIAGE-7034: Added sat status as first option to check power status and kept existing cray power option CASMTRIAGE-7042: Unload DVS and Lnet kernel modules from worker nodes for shutdown CASMTRIAGE-7038: For worker nodes: Stop UAIs, unmount Lustre, unmount DVS-mounted CPE CASMTRIAGE-7047: Added hints for when to work on off/on for external file system servers in parallel to other off/on activities and included checks that external file systems are ready before trying to use them CASMTRIAGE-7048: Added workflow placeholder for site procedure to quiesce SpectrumScale GPFS on quorum nodes and unmount it on clients CASMTRIAGE-7052: Added Slingshot 2.1.1 SSH permission fix, reinstallation of fmn-debug rpm, and troubleshooting procedure when some Slingshot switches are offline CASMTRIAGE-6615: Check for potential expiration of Spire Intermedia CA certificate and Kubernetes and bare metal etcd certificates during system power down preparation CASMTRIAGE-6614: Check for recent Nexus backup when Preparing for full system power down
Checklist
- [x] If I added any command snippets, the steps they belong to follow the prompt conventions (see example).
- [ ] If I added a new directory, I also updated
.github/CODEOWNERS
with the corresponding team in Cray-HPE. - [x] My commits or Pull-Request Title contain my JIRA information, or I do not have a JIRA.