ansible-slurm-appliance icon indicating copy to clipboard operation
ansible-slurm-appliance copied to clipboard

Replace opendistro

Open sjpb opened this issue 3 years ago • 4 comments

Ticket: https://stackhpc.atlassian.net/browse/DEV-855

OpenDistro is EOL.

This PR:

  • [x] Replaces OpenDistro with OpenSearch.
  • [x] Updates filebeat to the newest-supported version.
  • [x] Adds the required version faking to enable filebeat.
  • [x] Configures important opensearch settings for production use.
  • [x] Updates Grafana version
  • [x] Updates the opensearch Grafana datasource plugin definition
  • [x] Removes the appliances grafana-datasources role as we can use grafana's provisioning mode with the cloudalchemy.grafana rather than requiring a customised API-based approach.
  • [x] Adds a test in CI that expected jobs from the hpctests runs are found via Grafana (NB: for slurm-stats this has to be 5 mins past job completion, so may add some delay)
  • [x] Changes the storage used for open* from a podman volume to a host directory to enable easier future upgrades/migration/backups.
  • [x] Adds a playbook ansible/adhoc/migrate-opendistro.yml to migrate opendistro data to opensearch (checked by upgrading a running cluster from main 7bcacb0)
  • [x] Uses a new "prebuilt" image in arcus with the updated Grafana version.

Once merged and passed:

  • [ ] Move appropriate image to release bucket

Closes #70.

sjpb avatar Jul 21 '22 16:07 sjpb

It is leaving massive gaps between jobs: image

ETA: fixed by 3a25308

sjpb avatar Aug 09 '22 09:08 sjpb

3a25308 fails on checking that the hpctests jobs exist in grafana/opensearch. Turns out neither dashboard nor datasources have been provisioned in grafana after rebuilding control node with packer-built image (although grafana is running). Possibly this has always been broken, just not checked for until this PR.

direct configuration:

2022-08-11T12:48:19.5059996Z TASK [cloudalchemy.grafana : Create/Update datasources file (provisioning)] ****
2022-08-11T12:48:19.5062754Z task path: /home/runner/work/ansible-slurm-appliance/ansible-slurm-appliance/ansible/roles/cloudalchemy.grafana/tasks/datasources.yml:26
2022-08-11T12:48:23.6044928Z NOTIFIED HANDLER cloudalchemy.grafana : restart grafana for ci2839370468-control
2022-08-11T12:48:23.6047794Z changed: [ci2839370468-control] => {
2022-08-11T12:48:23.6048876Z     "changed": true,
2022-08-11T12:48:23.6050009Z     "checksum": "c3246446b05c316d4a9dfde60badfa068fd3936c",
2022-08-11T12:48:23.6051407Z     "dest": "/etc/grafana/provisioning/datasources/ansible.yml",
2022-08-11T12:48:23.6052598Z     "gid": 979,
2022-08-11T12:48:23.6053560Z     "group": "grafana",
2022-08-11T12:48:23.6054671Z     "md5sum": "ed1ebc1e5e73ffca64f835dea1145202",
2022-08-11T12:48:23.6055784Z     "mode": "0640",
2022-08-11T12:48:23.6056768Z     "owner": "root",
2022-08-11T12:48:23.6057824Z     "secontext": "system_u:object_r:etc_t:s0",
2022-08-11T12:48:23.6058901Z     "size": 569,
2022-08-11T12:48:23.6060242Z     "src": "/var/lib/rocky/.ansible/tmp/ansible-tmp-1660222100.111777-5114-195439064785952/source",
2022-08-11T12:48:23.6061512Z     "state": "file",
2022-08-11T12:48:23.6062449Z     "uid": 0
2022-08-11T12:48:23.6063363Z }

control image build:

2022-08-11T12:57:56.6329342Z     openstack.control: TASK [cloudalchemy.grafana : Create/Update datasources file (provisioning)] ****
2022-08-11T12:57:56.6339799Z     openstack.control: task path: /home/runner/work/ansible-slurm-appliance/ansible-slurm-appliance/ansible/roles/cloudalchemy.grafana/tasks/datasources.yml:26
2022-08-11T12:57:56.6530631Z     openstack.control: skipping: [default] => {
2022-08-11T12:57:56.6532331Z     openstack.control:     "changed": false,
2022-08-11T12:57:56.6533879Z     openstack.control:     "skip_reason": "Conditional result was False"
2022-08-11T12:57:56.6535269Z     openstack.control: }
2022-08-11T12:57:56.7564322Z     openstack.control:```

sjpb avatar Aug 15 '22 10:08 sjpb

Various problems after rebuild of control node:

  • [x] opensearch container failed to start, looks like this: https://access.redhat.com/solutions/6904591
  • [x] grafana dashboards aren't provisioned in the image, presumably b/c this is being done via the API and grafana is not up in the builder (check what the logic is/what actually happens)
  • [x] grafana admin password is not set after rebuild - is at default admin value despite correctly being in /etc/grafana/grafana.ini - this is the error which is currently failing CI (although then the others will cause failure too)

sjpb avatar Aug 15 '22 15:08 sjpb

The image openhpc-220811-0842.qcow2 (here)[https://github.com/stackhpc/slurm_image_builder/pull/5/files] installs 9.0.3, then the upgrade task bumps to 9.0.7, then the pin to 9.0.3 at environments/common/inventory/group_vars/all/grafana.yml:grafana_version means the grafana role downgrades it again.

sjpb avatar Aug 16 '22 06:08 sjpb

A note on certs:

  • The version faking for filebeat requires a custom opensearch.yml config file to be mounted into the container.
  • As we want to add an admin user password we have to enable the security plugin, which means certs have to be configured.
  • The opensearch container will by default use hardcoded certs and configure opensearch.yml - in our case that means its modified the one mounted in from the host.
  • This is problematic as the certs it uses are copied into the config directory (next to the opensearch.yml) - and are therefore lost when the container is deleted, which happens as we use podman run --remove but would anyway happen on reimage (how I found the problem). When opensearch starts again it sees the certs listed in opensearch.yml and dies as they do not exist.
  • We cannot mount in the entire config directory as the one in the container has files which are required. We cannot mount it as an overlay as we need the :U flag on the volume which is not compatible with other options. We cannot mount the default certs to somewhere writeable on the host, as they don't exist on first startup.

The solution is to generate self-signed certs ourselves, mount them into the container, and add them to opensearch.yml ourselves. This appears to be fine; a minor problem is that runnning monitoring.yml again changes the owner/group of some files which are mounted in, which kills the service. It restarts almost instantaneously though and recovers.

Finally note that this uses the plugins.security.allow_default_init_securityindex: true to avoid having to wait for opensearch container to finish initialising and then run securityadmin.sh. I note that the docs state:

The opensearch.yml file also contains the plugins.security.allow_default_init_securityindex property. When set to true, the security plugin uses default security settings if an attempt to create the security index fails when OpenSearch launches. Default security settings are stored in YAML files contained in the opensearch-project/security/config directory. By default, this setting is false.

I think the only potentially-problematic part of this is opensearch-project/security/config/internal_users.yml, but we create our own internal users anyway which will override this.

sjpb avatar Dec 14 '22 16:12 sjpb

Note certs have a hardcoded 2yr life.

sjpb avatar Dec 20 '22 09:12 sjpb

@m-bull I tried using community.crypto.x509_certificate_info to extract validity and delete if necessary, but as podman chowns everything in certs/ the ansible loops/logic were just getting really messy. Put that on hold as not best use of effort at the moment.

sjpb avatar Dec 21 '22 10:12 sjpb

FIXED: that merge won't be right as we need an image using updated grafana etc.

sjpb avatar Jan 10 '23 16:01 sjpb