treydock

Results 151 comments of treydock

The namespace in template isn't necessary as when you go to install with Helm you can pass `--namespace` flag and that will set namespace. Having namespace be something in `values.yaml`...

@cible Is this ready for review? The pull request is currently marked as a draft.

@gmenuel Please address the merge conflicts.

One thing that could probably be improved is allowing path to mmhealth to be changed to avoid hardcoding the value.

Made path to mmhealth configurable and updated README.

We noticed something with GPFS can cause `mmhealth` to be unreliable but that `mmfsadm test verbs status` is another way to test that GPFS is actually using RDMA and not...

Just in case someone comes across this the checks we're now using are basically marking node offline is kmalloc item in slabinfo is at >5GB based on active object count...

Ya 1.4.4 sounds fine, we have the check deployed locally and gives us a bit of time to run this in production to refine if needed.

@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom `reason=` to scontrol during the reboot. We...

Aside from my minor tweak, this works with SLURM 20.11.7 by doing something like this: ``` scontrol reboot ASAP nextstate=down ``` The default of `nextstate=resume` prevents this extra reboot logic...