Steve Brasier issues

Results 162 issues of


                                            Steve Brasier

Add blackbox probing of slurm endpoints

Add probing of slurmctld and slurmd ports using the prometheus blackbox exporter. Will allow monitoring/alerting on slurm{ctld,d} being up/reachable. "Default" (= `everything` layout) behaviour is to put the blackbox exporter...

Add hardening functionality

This uses the openstack [ansible-hardening](https://github.com/openstack/ansible-hardening) role to apply STIG security configurations to hosts in the `hardening` group. The `environments/common/layouts/everything` template used by cookiecutter puts the `login` group into the `hardening`...

Make demo deployments easier without support

- Changes the README to provide a complete quickstart guide - Adds an initial configuration guide - Uses the skeleton TF to deploy a complete working group_vars in the environment,...

Show info for running jobs in monitoring

Ticket: https://stackhpc.atlassian.net/browse/DEV-1018 Could use `https://slurm.schedmd.com/sstat.html` output?

enhancement

"slurm openstack tools" needs refactoring

Ticket: https://stackhpc.atlassian.net/browse/DEV-1022 - The names of the Ansible collection https://github.com/stackhpc/ansible_collection_slurm_openstack_tools (which installs as `slurm_openstack_tools`) and the Python package https://github.com/stackhpc/slurm-openstack-tools are massively confusing - The ansible `tests` role has been superceded...

maintenance

OOD shell prompts to accept hostkeys

Ticket: https://stackhpc.atlassian.net/browse/DEV-976 Using the browser OOD shell prompts to accept hostkeys (presumably its the login node sshing into itself, but I didn't check). This is clunky but as the resulting...

enhancement

prometheus slurm exporter failing

Ticket: https://stackhpc.atlassian.net/browse/DEV-1021 service (`prometheus-slurm-exporter`) is constantly failing/restarting. From syslog: ``` May 24 16:09:39 devrebuild-control prometheus-slurm-exporter[113346]: panic: runtime error: index out of range [4] with length 4 May 24 16:09:39 devrebuild-control...

bug

Update by default on "first deploy"

Ticket: https://stackhpc.atlassian.net/browse/DEV-1020 Currently yum updates are only done by default in `ansible/site.yml` when running in the packer build (see `update_enable` in `environments/common/inventory/group_vars/{all/update.yml,builder/defaults.yml}`). This is to ensure: 1. Running site.yml is...

enhancement

prometheus node exporter too chatty in syslog

Ticket: https://stackhpc.atlassian.net/browse/DEV-1019 Lots of messages here make it hard to debug issues. Could do with turning log level down or something.

enhancement

grafana docs reference https:

Ticket (broader than this): https://stackhpc.atlassian.net/browse/DEV-1016 Under https://github.com/stackhpc/ansible-slurm-appliance/blob/main/docs/monitoring-and-logging.README.md#access it specifies using https:// to login to grafana but I believe we currently only support http://?

documentation