Testing testbed on Debian 12
Preparing and Testing Testbed for Debian Bookworm Compatibility
- Related issue: https://github.com/osism/issues/issues/1028 (provides context for this work)
- Dependency: https://github.com/osism/terraform-base/pull/56 (required changes for Debian Bookworm support)
This task involves updating and testing the testbed to ensure compatibility with Debian Bookworm. We'll be verifying the deployment of various services and components.
Deployment Status
- [x] Manager
Services
-
[x] Helper Services
-
[x] Kubernetes: Successful after multiple executions (5 times) of the script
TASK [Deploy kubernetes-dashboard helm chart] **********************************
Thursday 11 July 2024 15:04:44 +0000 (0:00:03.383) 0:00:03.383 *********
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "command": "/usr/sbin/helm get values --output=yaml kubernetes-dashboard", "msg": "Failure when executing Helm command. Exited 1.\nstdout: \nstderr: Error: query: failed to query with labels: Get \"https://192.168.16.8:6443/api/v1/namespaces/kubernetes-dashboard/secrets?labelSelector=name%3Dkubernetes-dashboard%2Cowner%3Dhelm\": dial tcp 192.168.16.8:6443: connect: connection refused - error from a previous attempt: unexpected EOF\n", "stderr": "Error: query: failed to query with labels: Get \"https://192.168.16.8:6443/api/v1/namespaces/kubernetes-dashboard/secrets?labelSelector=name%3Dkubernetes-dashboard%2Cowner%3Dhelm\": dial tcp 192.168.16.8:6443: connect: connection refused - error from a previous attempt: unexpected EOF\n", "stderr_lines": ["Error: query: failed to query with labels: Get \"https://192.168.16.8:6443/api/v1/namespaces/kubernetes-dashboard/secrets?labelSelector=name%3Dkubernetes-dashboard%2Cowner%3Dhelm\": dial tcp 192.168.16.8:6443: connect: connection refused - error from a previous attempt: unexpected EOF"], "stdout": "", "stdout_lines": []}
TASK [Upgrade the CAPI management cluster] *************************************
Monday 15 July 2024 08:47:17 +0000 (0:00:02.218) 0:00:11.585 ***********
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\nexport EXP_CLUSTER_RESOURCE_SET=true\nexport CLUSTER_TOPOLOGY=true\nexport GOPROXY=off\n\nclusterctl upgrade apply --core cluster-api:v1.6.2 --bootstrap kubeadm:v1.6.2 --control-plane kubeadm:v1.6.2 --infrastructure openstack:v0.9.0;\n", "delta": "0:00:06.962777", "end": "2024-07-15 08:47:25.180257", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:47:18.217480", "stderr": "Error: failed to check Cluster API version: failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://192.168.16.8:6443/apis/apiextensions.k8s.io/v1?timeout=30s\": dial tcp 192.168.16.8:6443: connect: connection refused", "stderr_lines": ["Error: failed to check Cluster API version: failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://192.168.16.8:6443/apis/apiextensions.k8s.io/v1?timeout=30s\": dial tcp 192.168.16.8:6443: connect: connection refused"], "stdout": "", "stdout_lines": []}
TASK [Get capi-system namespace phase] *****************************************
Monday 15 July 2024 08:47:32 +0000 (0:00:02.014) 0:00:02.014 ***********
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\n\nkubectl get ns capi-system -o json --ignore-not-found=true | jq .status.phase -r\n", "delta": "0:00:00.138639", "end": "2024-07-15 08:47:32.818659", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:47:32.680020", "stderr": "The connection to the server 192.168.16.8:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server 192.168.16.8:6443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}
TASK [Add control-plane label to all hosts in group control] *******************
Monday 15 July 2024 08:59:15 +0000 (0:00:14.503) 0:06:49.987 ***********
failed: [testbed-manager.testbed.osism.xyz] (item=testbed-node-0.testbed.osism.xyz) => {"ansible_loop_var": "item", "changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\nkubectl label node \"testbed-node-0\" node-role.osism.tech/control-plane=true\n", "delta": "0:00:03.191928", "end": "2024-07-15 08:59:18.837914", "item": "testbed-node-0.testbed.osism.xyz", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:59:15.645986", "stderr": "Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host", "stderr_lines": ["Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host"], "stdout": "", "stdout_lines": []}
failed: [testbed-manager.testbed.osism.xyz] (item=testbed-node-1.testbed.osism.xyz) => {"ansible_loop_var": "item", "changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\nkubectl label node \"testbed-node-1\" node-role.osism.tech/control-plane=true\n", "delta": "0:00:10.652723", "end": "2024-07-15 08:59:29.782229", "item": "testbed-node-1.testbed.osism.xyz", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:59:19.129506", "stderr": "Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host", "stderr_lines": ["Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host"], "stdout": "", "stdout_lines": []}
failed: [testbed-manager.testbed.osism.xyz] (item=testbed-node-2.testbed.osism.xyz) => {"ansible_loop_var": "item", "changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\nkubectl label node \"testbed-node-2\" node-role.osism.tech/control-plane=true\n", "delta": "0:00:03.195444", "end": "2024-07-15 08:59:33.270679", "item": "testbed-node-2.testbed.osism.xyz", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:59:30.075235", "stderr": "Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host", "stderr_lines": ["Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host"], "stdout": "", "stdout_lines": []}
- [ ] Ceph Services
TASK [Create block VGs] ********************************************************
Tuesday 30 July 2024 16:42:47 +0000 (0:00:01.273) 0:00:28.183 **********
failed: [testbed-node-0.testbed.osism.xyz] (item={'data': 'osd-block-9e1f0fc1-dcb5-5324-96ce-f669d42c37f8', 'data_vg': 'ceph-9e1f0fc1-dcb5-5324-96ce-f669d42c37f8'}) => {"ansible_loop_var": "item", "changed": false, "item": {"data": "osd-block-9e1f0fc1-dcb5-5324-96ce-f669d42c37f8", "data_vg": "ceph-9e1f0fc1-dcb5-5324-96ce-f669d42c37f8"}, "msg": "Failed to find required executable \"vgs\" in paths: /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin"}
failed: [testbed-node-0.testbed.osism.xyz] (item={'data': 'osd-block-6203bc46-d920-5918-addb-042be3529124', 'data_vg': 'ceph-6203bc46-d920-5918-addb-042be3529124'}) => {"ansible_loop_var": "item", "changed": false, "item": {"data": "osd-block-6203bc46-d920-5918-addb-042be3529124", "data_vg": "ceph-6203bc46-d920-5918-addb-042be3529124"}, "msg": "Failed to find required executable \"vgs\" in paths: /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin"}
Should be fixed with https://github.com/osism/ansible-collection-commons/pull/687.
After Ceph deployment on Debian 12, the Cluster has the following faulty state:
docker logs ceph-mgr-testbed-node-0 ->
2024-08-13T14:35:49.923+0000 7f530dd8e700 0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
2024-08-13T14:35:54.923+0000 7f530dd8e700 0 [dashboard INFO root] server: ssl=yes host=:: port=8443
2024-08-13T14:35:55.351+0000 7f530dd8e700 0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
2024-08-13T14:36:00.351+0000 7f530dd8e700 0 [dashboard INFO root] server: ssl=yes host=:: port=8443
2024-08-13T14:36:00.355+0000 7f530dd8e700 0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
::ffff:192.168.16.5 - - [13/Aug/2024:14:36:01] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.48.0"
2024-08-13T14:36:01.215+0000 7f52e7946700 0 [prometheus INFO cherrypy.access.139994101685160] ::ffff:192.168.16.5 - - [13/Aug/2024:14:36:01] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.48.0"
ceph-mds-testbed-node-0:
[root@testbed-node-0 /]# ceph -s
cluster:
id: 11111111-1111-1111-1111-111111111111
health: HEALTH_WARN
1 MDSs report slow metadata IOs
Reduced data availability: 8 pgs inactive
1 slow ops, oldest one blocked for 33 sec, mon.testbed-node-2 has slow ops
OSD count 0 < osd_pool_default_size 2
services:
mon: 3 daemons, quorum testbed-node-0,testbed-node-2,testbed-node-1 (age 38s)
mgr: testbed-node-1(active, since 4h), standbys: testbed-node-0, testbed-node-2
mds: 1/1 daemons up, 1 standby
osd: 0 osds: 0 up, 0 in
data:
volumes: 1/1 healthy
pools: 8 pools, 8 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
8 unknown
ceph-mds-testbed-node-0:
[root@testbed-node-0 /]# ceph fs status
cephfs - 0 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 creating testbed-node-1 10 13 12 0
POOL TYPE USED AVAIL
cephfs_metadata metadata 0 0
cephfs_data data 0 0
STANDBY MDS
testbed-node-0
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
ceph-mds-testbed-node-0:
[root@testbed-node-0 /]# ceph mds stat
cephfs:1 {0=testbed-node-1=up:creating} 1 up:standby
On Debian 12 we dont even get all the containers up properly:
dragon@testbed-node-2:~$ docker ps | grep -i ceph
a099c76babda nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/opt/ceph-container…" 5 hours ago Up 5 hours ceph-mgr-testbed-node-2
a8f9dbb9728a nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/opt/ceph-container…" 5 hours ago Up 5 hours ceph-mon-testbed-node-2
dragon@testbed-manager:/opt/configuration/scripts/deploy$ osism console testbed-node-1
Last login: Mon Aug 12 15:24:16 2024 from 192.168.16.5
dragon@testbed-node-1:~$ docker ps | grep -i ceph
df96c3c0a752 nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/usr/bin/ceph-crash" 5 hours ago Up 5 hours ceph-crash-testbed-node-1
e7e957beb7e4 nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/opt/ceph-container…" 5 hours ago Up 5 hours ceph-mds-testbed-node-1
dfb6a68c7e95 nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/opt/ceph-container…" 5 hours ago Up 5 hours ceph-mgr-testbed-node-1
8baaf424ece6 nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/opt/ceph-container…" 5 hours ago Up 5 hours ceph-mon-testbed-node-1
dragon@testbed-manager:/opt/configuration/scripts/deploy$ osism console testbed-node-0
Last login: Tue Aug 13 14:41:59 2024 from 192.168.16.5
dragon@testbed-node-0:~$ docker ps | grep -i ceph
22d403caf1d4 nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/opt/ceph-container…" 32 seconds ago Up 23 seconds ceph-rgw-testbed-node-0-rgw0
afef27a20733 nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/usr/bin/ceph-crash" 5 hours ago Up 5 hours ceph-crash-testbed-node-0
c7b680f376e0 nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/opt/ceph-container…" 5 hours ago Up 5 hours ceph-mds-testbed-node-0
93a1bce65f4b nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/opt/ceph-container…" 5 hours ago Up 5 hours ceph-mgr-testbed-node-0
1c77ccf5c8c9 nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy "/opt/ceph-container…" 5 hours ago Up 5 hours ceph-mon-testbed-node-0
STILL ALIVE [task 'ceph-osd : wait for all osd to be up' is running] ***********
FAILED - RETRYING: [testbed-node-2.testbed.osism.xyz -> testbed-node-0.testbed.osism.xyz]: wait for all osd to be up (1 retries left).
fatal: [testbed-node-2.testbed.osism.xyz -> testbed-node-0.testbed.osism.xyz(192.168.16.10)]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["docker", "exec", "ceph-mon-testbed-node-0", "ceph", "--cluster", "ceph", "osd", "stat", "-f", "json"], "delta": "0:00:02.660850", "end": "2024-08-06 11:44:38.588419", "msg": "", "rc": 0, "start": "2024-08-06 11:44:35.927569", "stderr": "", "stderr_lines": [], "stdout": "\n{\"epoch\":25,\"num_osds\":0,\"num_up_osds\":0,\"osd_up_since\":0,\"num_in_osds\":0,\"osd_in_since\":0,\"num_remapped_pgs\":0}", "stdout_lines": ["", "{\"epoch\":25,\"num_osds\":0,\"num_up_osds\":0,\"osd_up_since\":0,\"num_in_osds\":0,\"osd_in_since\":0,\"num_remapped_pgs\":0}"]}
?!
- [x] Infrastructure Services
TASK [opensearch : Create new log retention policy] ****************************
Thursday 11 July 2024 09:44:30 +0000 (0:00:03.097) 0:00:48.164 *********
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"action": "uri", "changed": false, "content": "", "elapsed": 30, "msg": "Status code was -1 and not [201]: Connection failure: The read operation timed out", "redirected": false, "status": -1, "url": "https://api-int.testbed.osism.xyz:9200/_plugins/_ism/policies/retention"}
TASK [opensearch : Check if a log retention policy exists] *********************
Tuesday 16 July 2024 14:56:57 +0000 (0:01:58.436) 0:02:39.891 **********
[WARNING]: Failure using method (v2_runner_on_failed) in callback plugin
(<ansible.plugins.callback.ara_default.CallbackModule object at
0x7f4faff07e90>): '0242ac1f-6512-5500-9271-0000000001b0'
[WARNING]: Failure using method (v2_playbook_on_stats) in callback plugin
(<ansible.plugins.callback.ara_default.CallbackModule object at
0x7f4faff07e90>): 'NoneType' object is not subscriptable
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"action": "uri", "changed": false, "connection": "close", "content": "{\"error\":{\"root_cause\":[{\"type\":\"cluster_block_exception\",\"reason\":\"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];\"}],\"type\":\"cluster_block_exception\",\"reason\":\"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];\"},\"status\":503}", "content_length": "271", "content_type": "application/json; charset=UTF-8", "elapsed": 0, "json": {"error": {"reason": "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];", "root_cause": [{"reason": "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];", "type": "cluster_block_exception"}], "type": "cluster_block_exception"}, "status": 503}, "msg": "Status code was 503 and not [200, 404]: HTTP Error 503: Service Unavailable", "redirected": false, "status": 503, "url": "https://api-int.testbed.osism.xyz:9200/_plugins/_ism/policies/retention"}
TASK [mariadb : Check MariaDB service port liveness] ***************************
Monday 15 July 2024 14:35:42 +0000 (0:00:02.414) 0:00:45.499 ***********
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.16.11:3306"}
...ignoring
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.16.10:3306"}
...ignoring
fatal: [testbed-node-2.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.16.12:3306"}
...ignoring
[...]
TASK [mariadb : Fail on existing but stopped cluster] **************************
Monday 15 July 2024 14:35:56 +0000 (0:00:02.001) 0:00:59.397 ***********
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible mariadb_recovery"}
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible mariadb_recovery"}
fatal: [testbed-node-2.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible mariadb_recovery"}
- [ ] OpenStack Services
TASK [nova-cell : Check nova keyring file] *************************************
Thursday 11 July 2024 11:14:08 +0000 (0:00:01.698) 0:04:44.085 *********
fatal: [testbed-node-0.testbed.osism.xyz -> localhost]: FAILED! => {"msg": "No file was found when using first_found."}
fatal: [testbed-node-1.testbed.osism.xyz -> localhost]: FAILED! => {"msg": "No file was found when using first_found."}
fatal: [testbed-node-2.testbed.osism.xyz -> localhost]: FAILED! => {"msg": "No file was found when using first_found."}
TASK [keystone : Creating keystone database] ***********************************
Monday 15 July 2024 14:45:16 +0000 (0:00:02.363) 0:01:37.158 ***********
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"action": "mysql_db", "changed": false, "msg": "unable to find /var/lib/ansible/.my.cnf. Exception message: (2013, 'Lost connection to MySQL server during query')"}
-
[ ] Rook Services
-
[ ] Monitoring Services
RUNNING HANDLER [grafana : Waiting for grafana to start on first node] *********
Thursday 11 July 2024 12:22:38 +0000 (0:00:09.114) 0:02:29.330 *********
skipping: [testbed-node-1.testbed.osism.xyz]
skipping: [testbed-node-2.testbed.osism.xyz]
FAILED - RETRYING: [testbed-node-0.testbed.osism.xyz]: Waiting for grafana to start on first node (12 retries left).
[...]
STILL ALIVE [task 'grafana : Waiting for grafana to start on first node' is running] ***
FAILED - RETRYING: [testbed-node-0.testbed.osism.xyz]: Waiting for grafana to start on first node (1 retries left).
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"action": "uri", "attempts": 12, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "http://192.168.16.10:3000/login"}
TASK [grafana : Wait for grafana application ready] ****************************
Thursday 11 July 2024 12:25:34 +0000 (0:00:28.014) 0:05:25.374 *********
FAILED - RETRYING: [testbed-node-1.testbed.osism.xyz]: Wait for grafana application ready (30 retries left).
[...]
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"action": "uri", "attempts": 30, "cache_control": "no-cache", "changed": false, "connection": "close", "content_length": "107", "content_type": "text/html", "elapsed": 0, "msg": "Status code was 503 and not [200]: HTTP Error 503: Service Unavailable", "redirected": false, "status": 503, "url": "https://api-int.testbed.osism.xyz:3000/login"}
PLAY RECAP *********************************************************************
2024-07-11 12:27:38 | INFO | Play has been completed. There may now be a delay until all logs have been written.
2024-07-11 12:27:38 | INFO | Please wait and do not abort execution.
testbed-node-0.testbed.osism.xyz : ok=20 changed=3 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
testbed-node-1.testbed.osism.xyz : ok=15 changed=4 unreachable=0 failed=1 skipped=5 rescued=0 ignored=0
testbed-node-2.testbed.osism.xyz : ok=15 changed=3 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0
TASK [prometheus : Creating prometheus database user and setting permissions] ***
Monday 15 July 2024 14:55:41 +0000 (0:00:10.391) 0:02:30.204 ***********
failed: [testbed-manager.testbed.osism.xyz] (item=testbed-node-0.testbed.osism.xyz) => {"action": "mysql_user", "ansible_loop_var": "item", "changed": false, "item": {"key": "0", "value": {"hosts": ["testbed-node-0.testbed.osism.xyz", "testbed-node-1.testbed.osism.xyz", "testbed-node-2.testbed.osism.xyz"]}}, "msg": "unable to connect to database, check login_user and login_******** are correct or /var/lib/ansible/.my.cnf has the credentials. Exception message: (2013, 'Lost connection to MySQL server during query')"}
Netbox - Manage Ankh-Morpork location:
++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' netbox-netbox-1
+ [[ healthy == \h\e\a\l\t\h\y ]]
+ osism netbox import
2024-05-29 09:32:17 | INFO | Task 157a9974-9b47-4469-9b4a-95b5a544807a is running. Wait. No more output.
+ osism netbox init
2024-05-29 09:32:21 | INFO | Task ccb3a63d-a9db-4937-bfcb-c19c77c3bc55 was prepared for execution.
2024-05-29 09:32:21 | INFO | It takes a moment until task ccb3a63d-a9db-4937-bfcb-c19c77c3bc55 has been started and output is visible here.
PLAY [Wait for netbox service] *************************************************
TASK [Wait for netbox service] *************************************************
ok: [localhost]
PLAY [Manage sites and locations] **********************************************
TASK [Manage Discworld site] ***************************************************
changed: [localhost]
TASK [Manage Ankh-Morpork location] ********************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "\n<!DOCTYPE html>\n<html lang=\"en\">\n\n<head>\n <title>Server Error</title>\n <link rel=\"stylesheet\" href=\"/static/netbox-light.css\" />\n <meta charset=\"UTF-8\">\n</head>\n\n<body>\n <div class=\"container-fluid\">\n <div class=\"row\">\n <div class=\"col col-md-6 offset-md-3\">\n <div class=\"card border-danger mt-5\">\n <h5 class=\"card-header\">\n <i class=\"mdi mdi-alert\"></i> Server Error\n </h5>\n <div class=\"card-body\">\n \n <p>\n There was a problem with your request. Please contact an administrator.\n </p>\n \n <hr />\n <p>\n The complete exception is provided below:\n </p>\n<pre class=\"block\"><strong><class 'dcim.models.sites.Site.MultipleObjectsReturned'></strong><br />\nget() returned more than one Site -- it returned 2!\n\nPython version: 3.10.6\nNetBox version: 3.4.8</pre>\n <p>\n If further assistance is required, please post to the <a href=\"https://github.com/netbox-community/netbox/discussions\">NetBox discussion forum</a> on GitHub.\n </p>\n <div class=\"text-end\">\n <a href=\"/\" class=\"btn btn-primary\">Home Page</a>\n </div>\n </div>\n </div>\n </div>\n </div>\n </div>\n</body>\n\n</html>\n"}
Osism tool can't pull images:
TASK [service-images-pull : barbican | Pull images] ****************************
Wednesday 29 May 2024 09:50:39 +0000 (0:00:02.499) 0:00:13.337 *********
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! =>
msg: '[''{{ node_config_directory }}/barbican-api/:{{ container_config_directory }}/:ro'', ''/etc/localtime:/etc/localtime:ro'', "{{ ''/etc/timezone:/etc/timezone:ro'' if ansible_facts.os_family == ''Debian'' else '''' }}", ''barbican:/var/lib/barbican/'', ''kolla_logs:/var/log/kolla/'', "{{ kolla_dev_repos_directory ~ ''/barbican/barbican:/var/lib/kolla/venv/lib/python'' ~ distro_python_version ~ ''/site-packages/barbican'' if barbican_dev_mode | bool else '''' }}"]: ''dict object'' has no attribute ''os_family''. ''dict object'' has no attribute ''os_family''. [''{{ node_config_directory }}/barbican-api/:{{ container_config_directory }}/:ro'', ''/etc/localtime:/etc/localtime:ro'', "{{ ''/etc/timezone:/etc/timezone:ro'' if ansible_facts.os_family == ''Debian'' else '''' }}", ''barbican:/var/lib/barbican/'', ''kolla_logs:/var/log/kolla/'', "{{ kolla_dev_repos_directory ~ ''/barbican/barbican:/var/lib/kolla/venv/lib/python'' ~ distro_python_version ~ ''/site-packages/barbican'' if barbican_dev_mode
| bool else '''' }}"]: ''dict object'' has no attribute ''os_family''. ''dict object'' has no attribute ''os_family''. [''{{ node_config_directory }}/barbican-api/:{{ container_config_directory }}/:ro'', ''/etc/localtime:/etc/localtime:ro'', "{{ ''/etc/timezone:/etc/timezone:ro'' if ansible_facts.os_family == ''Debian'' else '''' }}", ''barbican:/var/lib/barbican/'', ''kolla_logs:/var/log/kolla/'', "{{ kolla_dev_repos_directory ~ ''/barbican/barbican:/var/lib/kolla/venv/lib/python'' ~ distro_python_version ~ ''/site-packages/barbican'' if barbican_dev_mode | bool else '''' }}"]: ''dict object'' has no attribute ''os_family''. ''dict object'' has no attribute ''os_family''. [''{{ node_config_directory }}/barbican-api/:{{ container_config_directory }}/:ro'', ''/etc/localtime:/etc/localtime:ro'', "{{ ''/etc/timezone:/etc/timezone:ro'' if ansible_facts.os_family == ''Debian'' else '''' }}", ''barbican:/var/lib/barbican/'', ''kolla_logs:/var/log/kolla/'', "{{ kolla_dev_repos_directory ~ ''/barbican/barbican:/var/lib/kolla/venv/lib/python''
~ distro_python_version ~ ''/site-packages/barbican'' if barbican_dev_mode | bool else '''' }}"]: ''dict object'' has no attribute ''os_family''. ''dict object'' has no attribute ''os_family'''
PLAY RECAP *********************************************************************
2024-05-29 09:50:43 | INFO | Play has been completed. There may now be a delay until all logs have been written.
2024-05-29 09:50:43 | INFO | Please wait and do not abort execution.
testbed-node-0.testbed.osism.xyz : ok=3 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
testbed-node-1.testbed.osism.xyz : ok=3 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
testbed-node-2.testbed.osism.xyz : ok=3 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
Wednesday 29 May 2024 09:50:43 +0000 (0:00:04.277) 0:00:17.615 *********
===============================================================================
service-images-pull : barbican | Pull images ---------------------------- 4.28s
Group hosts based on enabled services ----------------------------------- 3.68s
Group hosts based on Kolla action --------------------------------------- 3.53s
barbican : include_tasks ------------------------------------------------ 2.50s
drivetemp is currently not activatable at the testbed, despite that integration-test's are working fine:
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], "stdout": "", "stdout_lines": []}
Manager:
++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' ceph-ansible
template parsing error: template: :1:8: executing "" at <.State.Health.Status>: map has no entry for key "Health"
+ [[ '' == \h\e\a\l\t\h\y ]]
+ (( attempt_num++ == max_attempts ))
+ sleep 5
Deploy:
++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' ceph-ansible
+ [[ healthy == \h\e\a\l\t\h\y ]]
+ wait_for_container_healthy 60 kolla-ansible
+ local max_attempts=60
+ local name=kolla-ansible
+ local attempt_num=1
But I'm finally always running into the same issue:
fatal: [testbed-node-2.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], "stdout": "", "stdout_lines": []}
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], "stdout": "", "stdout_lines": []}
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], "stdout": "", "stdout_lines": []}
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-21-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-21-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-21-cloud-amd64"], "stdout": "", "stdout_lines": []}
It appears to be the case that Ansible does not trigger the task correctly we have cloned to:
/opt/src/osism/ansible-collection-services/ansible-collection-services/roles/hddtemp
Despite that the content of the folder on the Testbed machines is right. We further have to investigate here
Still getting issues at the end of the deploy step ...
STILL ALIVE [task 'Wait until service is available' is running] ****************
FAILED - RETRYING: [localhost]: Wait until service is available (5 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (4 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (3 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (2 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (1 retries left).
fatal: [localhost]: FAILED! => {"attempts": 30, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "https://keycloak.testbed.osism.xyz/auth/"}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
2024-06-12 11:15:04 | INFO | Play has been completed. There may now be a delay until all logs have been written.
localhost : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
2024-06-12 11:15:04 | INFO | Please wait and do not abort execution.
Wednesday 12 June 2024 11:15:04 +0000 (0:03:00.432) 0:03:03.504 ********
===============================================================================
Wait until service is available --------------------------------------- 180.43s
2024-06-12 11:15:04 | INFO | Task 9b6b792f-e06d-4085-a438-09e1020a3de2 (keycloak-oidc-client-config) was prepared for execution.
2024-06-12 11:15:04 | INFO | It takes a moment until task 9b6b792f-e06d-4085-a438-09e1020a3de2 (keycloak-oidc-client-config) has been started and output is visible here.
PLAY [Configure OIDC client for Keystone] **************************************
TASK [Wait until service is available] *****************************************
Wednesday 12 June 2024 11:15:10 +0000 (0:00:02.695) 0:00:02.695 ********
FAILED - RETRYING: [localhost]: Wait until service is available (30 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (29 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (28 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (27 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (26 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (25 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (24 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (23 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (22 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (21 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (20 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (19 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (18 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (17 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (16 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (15 retries left).
ok: [localhost]
TASK [Log in to Keycloak] ******************************************************
Wednesday 12 June 2024 11:16:47 +0000 (0:01:36.705) 0:01:39.401 ********
ok: [localhost]
TASK [Get available realms] ****************************************************
Wednesday 12 June 2024 11:16:56 +0000 (0:00:08.584) 0:01:47.985 ********
ok: [localhost]
TASK [Filter available realms] *************************************************
Wednesday 12 June 2024 11:17:00 +0000 (0:00:04.717) 0:01:52.702 ********
ok: [localhost]
TASK [Create target realm if it doesn't exist] *********************************
Wednesday 12 June 2024 11:17:02 +0000 (0:00:01.599) 0:01:54.303 ********
changed: [localhost]
TASK [Get available clients in realm] ******************************************
Wednesday 12 June 2024 11:17:08 +0000 (0:00:05.899) 0:02:00.202 ********
ok: [localhost]
TASK [Filter available clients in realm] ***************************************
Wednesday 12 June 2024 11:17:11 +0000 (0:00:03.727) 0:02:03.929 ********
ok: [localhost]
TASK [Create OIDC client configuration] ****************************************
Wednesday 12 June 2024 11:17:13 +0000 (0:00:01.724) 0:02:05.653 ********
changed: [localhost]
TASK [Get internal ID for client keystone] *************************************
Wednesday 12 June 2024 11:17:18 +0000 (0:00:04.691) 0:02:10.345 ********
ok: [localhost]
TASK [Filter internal ID for client keystone] **********************************
Wednesday 12 June 2024 11:17:22 +0000 (0:00:03.786) 0:02:14.132 ********
ok: [localhost]
TASK [Get available mappers for client] ****************************************
Wednesday 12 June 2024 11:17:23 +0000 (0:00:01.142) 0:02:15.275 ********
ok: [localhost]
TASK [Filter available mappers for client] *************************************
Wednesday 12 June 2024 11:17:27 +0000 (0:00:03.936) 0:02:19.211 ********
ok: [localhost]
TASK [Create mappers for client] ***********************************************
Wednesday 12 June 2024 11:17:28 +0000 (0:00:01.378) 0:02:20.590 ********
changed: [localhost] => (item=openstack-user-domain)
changed: [localhost] => (item=openstack-default-project)
TASK [Get available components in realm] ***************************************
Wednesday 12 June 2024 11:17:35 +0000 (0:00:06.404) 0:02:26.995 ********
ok: [localhost]
TASK [Filter available components in realm] ************************************
Wednesday 12 June 2024 11:17:38 +0000 (0:00:03.627) 0:02:30.623 ********
ok: [localhost]
TASK [Add privateKey and certificate to realm] *********************************
Wednesday 12 June 2024 11:17:39 +0000 (0:00:01.311) 0:02:31.935 ********
fatal: [localhost]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result"}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
localhost : ok=15 changed=3 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
2024-06-12 11:17:42 | INFO | Play has been completed. There may now be a delay until all logs have been written.
2024-06-12 11:17:42 | INFO | Please wait and do not abort execution.
Wednesday 12 June 2024 11:17:42 +0000 (0:00:02.407) 0:02:34.342 ********
===============================================================================
Wait until service is available ---------------------------------------- 96.71s
Log in to Keycloak ------------------------------------------------------ 8.58s
Create mappers for client ----------------------------------------------- 6.40s
Create target realm if it doesn't exist --------------------------------- 5.90s
Get available realms ---------------------------------------------------- 4.72s
Create OIDC client configuration ---------------------------------------- 4.69s
Get available mappers for client ---------------------------------------- 3.94s
Get internal ID for client keystone ------------------------------------- 3.79s
Get available clients in realm ------------------------------------------ 3.73s
Get available components in realm --------------------------------------- 3.63s
Add privateKey and certificate to realm --------------------------------- 2.41s
Filter available clients in realm --------------------------------------- 1.72s
Filter available realms ------------------------------------------------- 1.60s
Filter available mappers for client ------------------------------------- 1.38s
Filter available components in realm ------------------------------------ 1.31s
Filter internal ID for client keystone ---------------------------------- 1.14s
make[1]: *** [Makefile:115: deploy] Error 2
make[1]: Leaving directory '/home/claris/osism/osism/nobel-testbed/testbed/terraform'
make: *** [Makefile:108: deploy] Error 2
Health check for ceph-ansible after manager deployment failed:
[…]
++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' ceph-ansible
template parsing error: template: :1:8: executing "" at <.State.Health.Status>: map has no entry for key "Health"
+ [[ '' == \h\e\a\l\t\h\y ]]
+ (( attempt_num++ == max_attempts ))
+ sleep 5
++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' ceph-ansible
template parsing error: template: :1:8: executing "" at <.State.Health.Status>: map has no entry for key "Health"
+ [[ '' == \h\e\a\l\t\h\y ]]
+ (( attempt_num++ == max_attempts ))
+ return 1
make[1]: *** [Makefile:125: deploy-manager] Error 1
drivetemp is currently not activatable at the testbed, despite that integration-test's are working fine:
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp >not found in directory /lib/modules/6.1.0-18-cloud-amd64", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found >in directory /lib/modules/6.1.0-18-cloud-amd64\", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], >"stdout": "", "stdout_lines": []}
Got this error again. Probably connected to this change: https://github.com/osism/testbed/commit/16a35ad37376125f107c5e842aa5492b8468cbc4 and the call of deploy-manager.sh at terraform/Makefile:126.
401 error concerning nexus role:
TASK [osism.services.nexus : Deleting script create_repos_from_list] ***********
Monday 08 July 2024 14:08:40 +0000 (0:00:02.768) 0:00:41.744 ***********
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "connection": "close", "content_length": "0", "date": "Mon, 08 Jul 2024 14:08:41 GMT", "elapsed": 0, "msg": "Status code was 401 and not [204, 404]: HTTP Error 401: Unauthorized", "redirected": false, "server": "Nexus/3.69.0-02 (OSS)", "status": 401, "url": "https://nexus.testbed.osism.xyz/service/rest/v1/script/create_repos_from_list", "www_authenticate": "BASIC realm=\"Sonatype Nexus Repository Manager\"", "x_content_type_options": "nosniff"}
Docker community galaxy packages are offline:
https://galaxy.ansible.com/api/v3/plugin/ansible/content/published/collections/artifacts/community-docker-3.10.4.tar.gz
File does not exist anymore on AWS.
Debian 12 is now getting deployed with all major stages:
- 001-helper-services.sh
- 005-kubernetes.sh
- 006-kubernetes-clusterapi.sh
- 100-ceph-services.sh
- 200-infrastructure-services.sh
- 300-openstack-services.sh
- 310-openstack-services-extended.sh
- 400-monitoring-services.sh