testbed icon indicating copy to clipboard operation
testbed copied to clipboard

Testing testbed on Debian 12

Open lindenb1 opened this issue 1 year ago • 8 comments

Preparing and Testing Testbed for Debian Bookworm Compatibility

  • Related issue: https://github.com/osism/issues/issues/1028 (provides context for this work)
  • Dependency: https://github.com/osism/terraform-base/pull/56 (required changes for Debian Bookworm support)

This task involves updating and testing the testbed to ensure compatibility with Debian Bookworm. We'll be verifying the deployment of various services and components.

Deployment Status

  • [x] Manager

Services

  • [x] Helper Services

  • [x] Kubernetes: Successful after multiple executions (5 times) of the script

TASK [Deploy kubernetes-dashboard helm chart] **********************************
Thursday 11 July 2024  15:04:44 +0000 (0:00:03.383)       0:00:03.383 ********* 
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "command": "/usr/sbin/helm get values --output=yaml kubernetes-dashboard", "msg": "Failure when executing Helm command. Exited 1.\nstdout: \nstderr: Error: query: failed to query with labels: Get \"https://192.168.16.8:6443/api/v1/namespaces/kubernetes-dashboard/secrets?labelSelector=name%3Dkubernetes-dashboard%2Cowner%3Dhelm\": dial tcp 192.168.16.8:6443: connect: connection refused - error from a previous attempt: unexpected EOF\n", "stderr": "Error: query: failed to query with labels: Get \"https://192.168.16.8:6443/api/v1/namespaces/kubernetes-dashboard/secrets?labelSelector=name%3Dkubernetes-dashboard%2Cowner%3Dhelm\": dial tcp 192.168.16.8:6443: connect: connection refused - error from a previous attempt: unexpected EOF\n", "stderr_lines": ["Error: query: failed to query with labels: Get \"https://192.168.16.8:6443/api/v1/namespaces/kubernetes-dashboard/secrets?labelSelector=name%3Dkubernetes-dashboard%2Cowner%3Dhelm\": dial tcp 192.168.16.8:6443: connect: connection refused - error from a previous attempt: unexpected EOF"], "stdout": "", "stdout_lines": []}
TASK [Upgrade the CAPI management cluster] *************************************
Monday 15 July 2024  08:47:17 +0000 (0:00:02.218)       0:00:11.585 *********** 
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\nexport EXP_CLUSTER_RESOURCE_SET=true\nexport CLUSTER_TOPOLOGY=true\nexport GOPROXY=off\n\nclusterctl upgrade apply  --core cluster-api:v1.6.2  --bootstrap kubeadm:v1.6.2  --control-plane kubeadm:v1.6.2  --infrastructure openstack:v0.9.0;\n", "delta": "0:00:06.962777", "end": "2024-07-15 08:47:25.180257", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:47:18.217480", "stderr": "Error: failed to check Cluster API version: failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://192.168.16.8:6443/apis/apiextensions.k8s.io/v1?timeout=30s\": dial tcp 192.168.16.8:6443: connect: connection refused", "stderr_lines": ["Error: failed to check Cluster API version: failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://192.168.16.8:6443/apis/apiextensions.k8s.io/v1?timeout=30s\": dial tcp 192.168.16.8:6443: connect: connection refused"], "stdout": "", "stdout_lines": []}
TASK [Get capi-system namespace phase] *****************************************
Monday 15 July 2024  08:47:32 +0000 (0:00:02.014)       0:00:02.014 *********** 
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\n\nkubectl get ns capi-system -o json --ignore-not-found=true | jq .status.phase -r\n", "delta": "0:00:00.138639", "end": "2024-07-15 08:47:32.818659", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:47:32.680020", "stderr": "The connection to the server 192.168.16.8:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server 192.168.16.8:6443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}
TASK [Add control-plane label to all hosts in group control] *******************
Monday 15 July 2024  08:59:15 +0000 (0:00:14.503)       0:06:49.987 *********** 
failed: [testbed-manager.testbed.osism.xyz] (item=testbed-node-0.testbed.osism.xyz) => {"ansible_loop_var": "item", "changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\nkubectl label node \"testbed-node-0\" node-role.osism.tech/control-plane=true\n", "delta": "0:00:03.191928", "end": "2024-07-15 08:59:18.837914", "item": "testbed-node-0.testbed.osism.xyz", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:59:15.645986", "stderr": "Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host", "stderr_lines": ["Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host"], "stdout": "", "stdout_lines": []}
failed: [testbed-manager.testbed.osism.xyz] (item=testbed-node-1.testbed.osism.xyz) => {"ansible_loop_var": "item", "changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\nkubectl label node \"testbed-node-1\" node-role.osism.tech/control-plane=true\n", "delta": "0:00:10.652723", "end": "2024-07-15 08:59:29.782229", "item": "testbed-node-1.testbed.osism.xyz", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:59:19.129506", "stderr": "Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host", "stderr_lines": ["Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host"], "stdout": "", "stdout_lines": []}
failed: [testbed-manager.testbed.osism.xyz] (item=testbed-node-2.testbed.osism.xyz) => {"ansible_loop_var": "item", "changed": false, "cmd": "set -o pipefail\n\nexport KUBECONFIG=/share/kubeconfig\nkubectl label node \"testbed-node-2\" node-role.osism.tech/control-plane=true\n", "delta": "0:00:03.195444", "end": "2024-07-15 08:59:33.270679", "item": "testbed-node-2.testbed.osism.xyz", "msg": "non-zero return code", "rc": 1, "start": "2024-07-15 08:59:30.075235", "stderr": "Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host", "stderr_lines": ["Unable to connect to the server: dial tcp 192.168.16.8:6443: connect: no route to host"], "stdout": "", "stdout_lines": []}
  • [ ] Ceph Services
TASK [Create block VGs] ********************************************************
Tuesday 30 July 2024  16:42:47 +0000 (0:00:01.273)       0:00:28.183 ********** 
failed: [testbed-node-0.testbed.osism.xyz] (item={'data': 'osd-block-9e1f0fc1-dcb5-5324-96ce-f669d42c37f8', 'data_vg': 'ceph-9e1f0fc1-dcb5-5324-96ce-f669d42c37f8'}) => {"ansible_loop_var": "item", "changed": false, "item": {"data": "osd-block-9e1f0fc1-dcb5-5324-96ce-f669d42c37f8", "data_vg": "ceph-9e1f0fc1-dcb5-5324-96ce-f669d42c37f8"}, "msg": "Failed to find required executable \"vgs\" in paths: /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin"}
failed: [testbed-node-0.testbed.osism.xyz] (item={'data': 'osd-block-6203bc46-d920-5918-addb-042be3529124', 'data_vg': 'ceph-6203bc46-d920-5918-addb-042be3529124'}) => {"ansible_loop_var": "item", "changed": false, "item": {"data": "osd-block-6203bc46-d920-5918-addb-042be3529124", "data_vg": "ceph-6203bc46-d920-5918-addb-042be3529124"}, "msg": "Failed to find required executable \"vgs\" in paths: /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin"}

Should be fixed with https://github.com/osism/ansible-collection-commons/pull/687.

After Ceph deployment on Debian 12, the Cluster has the following faulty state:

docker logs ceph-mgr-testbed-node-0 ->

2024-08-13T14:35:49.923+0000 7f530dd8e700  0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
2024-08-13T14:35:54.923+0000 7f530dd8e700  0 [dashboard INFO root] server: ssl=yes host=:: port=8443
2024-08-13T14:35:55.351+0000 7f530dd8e700  0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
2024-08-13T14:36:00.351+0000 7f530dd8e700  0 [dashboard INFO root] server: ssl=yes host=:: port=8443
2024-08-13T14:36:00.355+0000 7f530dd8e700  0 [dashboard INFO root] Config not ready to serve, waiting: no certificate configured
::ffff:192.168.16.5 - - [13/Aug/2024:14:36:01] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.48.0"
2024-08-13T14:36:01.215+0000 7f52e7946700  0 [prometheus INFO cherrypy.access.139994101685160] ::ffff:192.168.16.5 - - [13/Aug/2024:14:36:01] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.48.0"
ceph-mds-testbed-node-0:
[root@testbed-node-0 /]# ceph -s
  cluster:
    id:     11111111-1111-1111-1111-111111111111
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 8 pgs inactive
            1 slow ops, oldest one blocked for 33 sec, mon.testbed-node-2 has slow ops
            OSD count 0 < osd_pool_default_size 2
 
  services:
    mon: 3 daemons, quorum testbed-node-0,testbed-node-2,testbed-node-1 (age 38s)
    mgr: testbed-node-1(active, since 4h), standbys: testbed-node-0, testbed-node-2
    mds: 1/1 daemons up, 1 standby
    osd: 0 osds: 0 up, 0 in
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 8 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             8 unknown
ceph-mds-testbed-node-0:
[root@testbed-node-0 /]# ceph fs status

cephfs - 0 clients
======
RANK   STATE         MDS        ACTIVITY   DNS    INOS   DIRS   CAPS  
 0    creating  testbed-node-1              10     13     12      0   
      POOL         TYPE     USED  AVAIL  
cephfs_metadata  metadata     0      0   
  cephfs_data      data       0      0   
 STANDBY MDS    
testbed-node-0  
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
ceph-mds-testbed-node-0:
[root@testbed-node-0 /]# ceph mds stat

cephfs:1 {0=testbed-node-1=up:creating} 1 up:standby

On Debian 12 we dont even get all the containers up properly:

dragon@testbed-node-2:~$ docker ps | grep -i ceph
a099c76babda   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/opt/ceph-container…"   5 hours ago         Up 5 hours                             ceph-mgr-testbed-node-2
a8f9dbb9728a   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/opt/ceph-container…"   5 hours ago         Up 5 hours                             ceph-mon-testbed-node-2
dragon@testbed-manager:/opt/configuration/scripts/deploy$ osism console testbed-node-1
Last login: Mon Aug 12 15:24:16 2024 from 192.168.16.5
dragon@testbed-node-1:~$ docker ps | grep -i ceph
df96c3c0a752   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/usr/bin/ceph-crash"    5 hours ago         Up 5 hours                             ceph-crash-testbed-node-1
e7e957beb7e4   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/opt/ceph-container…"   5 hours ago         Up 5 hours                             ceph-mds-testbed-node-1
dfb6a68c7e95   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/opt/ceph-container…"   5 hours ago         Up 5 hours                             ceph-mgr-testbed-node-1
8baaf424ece6   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/opt/ceph-container…"   5 hours ago         Up 5 hours                             ceph-mon-testbed-node-1
dragon@testbed-manager:/opt/configuration/scripts/deploy$ osism console testbed-node-0
Last login: Tue Aug 13 14:41:59 2024 from 192.168.16.5
dragon@testbed-node-0:~$ docker ps | grep -i ceph
22d403caf1d4   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/opt/ceph-container…"   32 seconds ago      Up 23 seconds                          ceph-rgw-testbed-node-0-rgw0
afef27a20733   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/usr/bin/ceph-crash"    5 hours ago         Up 5 hours                             ceph-crash-testbed-node-0
c7b680f376e0   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/opt/ceph-container…"   5 hours ago         Up 5 hours                             ceph-mds-testbed-node-0
93a1bce65f4b   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/opt/ceph-container…"   5 hours ago         Up 5 hours                             ceph-mgr-testbed-node-0
1c77ccf5c8c9   nexus.testbed.osism.xyz:8192/osism/ceph-daemon:quincy                         "/opt/ceph-container…"   5 hours ago         Up 5 hours                             ceph-mon-testbed-node-0
STILL ALIVE [task 'ceph-osd : wait for all osd to be up' is running] ***********
FAILED - RETRYING: [testbed-node-2.testbed.osism.xyz -> testbed-node-0.testbed.osism.xyz]: wait for all osd to be up (1 retries left).
fatal: [testbed-node-2.testbed.osism.xyz -> testbed-node-0.testbed.osism.xyz(192.168.16.10)]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["docker", "exec", "ceph-mon-testbed-node-0", "ceph", "--cluster", "ceph", "osd", "stat", "-f", "json"], "delta": "0:00:02.660850", "end": "2024-08-06 11:44:38.588419", "msg": "", "rc": 0, "start": "2024-08-06 11:44:35.927569", "stderr": "", "stderr_lines": [], "stdout": "\n{\"epoch\":25,\"num_osds\":0,\"num_up_osds\":0,\"osd_up_since\":0,\"num_in_osds\":0,\"osd_in_since\":0,\"num_remapped_pgs\":0}", "stdout_lines": ["", "{\"epoch\":25,\"num_osds\":0,\"num_up_osds\":0,\"osd_up_since\":0,\"num_in_osds\":0,\"osd_in_since\":0,\"num_remapped_pgs\":0}"]}

?!

  • [x] Infrastructure Services
TASK [opensearch : Create new log retention policy] ****************************
Thursday 11 July 2024  09:44:30 +0000 (0:00:03.097)       0:00:48.164 ********* 
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"action": "uri", "changed": false, "content": "", "elapsed": 30, "msg": "Status code was -1 and not [201]: Connection failure: The read operation timed out", "redirected": false, "status": -1, "url": "https://api-int.testbed.osism.xyz:9200/_plugins/_ism/policies/retention"}
TASK [opensearch : Check if a log retention policy exists] *********************
Tuesday 16 July 2024  14:56:57 +0000 (0:01:58.436)       0:02:39.891 ********** 
[WARNING]: Failure using method (v2_runner_on_failed) in callback plugin
(<ansible.plugins.callback.ara_default.CallbackModule object at
0x7f4faff07e90>): '0242ac1f-6512-5500-9271-0000000001b0'
[WARNING]: Failure using method (v2_playbook_on_stats) in callback plugin
(<ansible.plugins.callback.ara_default.CallbackModule object at
0x7f4faff07e90>): 'NoneType' object is not subscriptable
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"action": "uri", "changed": false, "connection": "close", "content": "{\"error\":{\"root_cause\":[{\"type\":\"cluster_block_exception\",\"reason\":\"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];\"}],\"type\":\"cluster_block_exception\",\"reason\":\"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];\"},\"status\":503}", "content_length": "271", "content_type": "application/json; charset=UTF-8", "elapsed": 0, "json": {"error": {"reason": "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];", "root_cause": [{"reason": "blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];", "type": "cluster_block_exception"}], "type": "cluster_block_exception"}, "status": 503}, "msg": "Status code was 503 and not [200, 404]: HTTP Error 503: Service Unavailable", "redirected": false, "status": 503, "url": "https://api-int.testbed.osism.xyz:9200/_plugins/_ism/policies/retention"}
TASK [mariadb : Check MariaDB service port liveness] ***************************
Monday 15 July 2024  14:35:42 +0000 (0:00:02.414)       0:00:45.499 *********** 
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.16.11:3306"}
...ignoring
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.16.10:3306"}
...ignoring
fatal: [testbed-node-2.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 10, "msg": "Timeout when waiting for search string MariaDB in 192.168.16.12:3306"}
...ignoring

[...]

TASK [mariadb : Fail on existing but stopped cluster] **************************
Monday 15 July 2024  14:35:56 +0000 (0:00:02.001)       0:00:59.397 *********** 
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible mariadb_recovery"}
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible mariadb_recovery"}
fatal: [testbed-node-2.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "MariaDB cluster exists but is stopped. Please start it using kolla-ansible mariadb_recovery"}
  • [ ] OpenStack Services
TASK [nova-cell : Check nova keyring file] *************************************
Thursday 11 July 2024  11:14:08 +0000 (0:00:01.698)       0:04:44.085 ********* 
fatal: [testbed-node-0.testbed.osism.xyz -> localhost]: FAILED! => {"msg": "No file was found when using first_found."}
fatal: [testbed-node-1.testbed.osism.xyz -> localhost]: FAILED! => {"msg": "No file was found when using first_found."}
fatal: [testbed-node-2.testbed.osism.xyz -> localhost]: FAILED! => {"msg": "No file was found when using first_found."}
TASK [keystone : Creating keystone database] ***********************************
Monday 15 July 2024  14:45:16 +0000 (0:00:02.363)       0:01:37.158 *********** 
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"action": "mysql_db", "changed": false, "msg": "unable to find /var/lib/ansible/.my.cnf. Exception message: (2013, 'Lost connection to MySQL server during query')"}
  • [ ] Rook Services

  • [ ] Monitoring Services

RUNNING HANDLER [grafana : Waiting for grafana to start on first node] *********
Thursday 11 July 2024  12:22:38 +0000 (0:00:09.114)       0:02:29.330 ********* 
skipping: [testbed-node-1.testbed.osism.xyz]
skipping: [testbed-node-2.testbed.osism.xyz]
FAILED - RETRYING: [testbed-node-0.testbed.osism.xyz]: Waiting for grafana to start on first node (12 retries left).
[...]
STILL ALIVE [task 'grafana : Waiting for grafana to start on first node' is running] ***
FAILED - RETRYING: [testbed-node-0.testbed.osism.xyz]: Waiting for grafana to start on first node (1 retries left).
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"action": "uri", "attempts": 12, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "http://192.168.16.10:3000/login"}

TASK [grafana : Wait for grafana application ready] ****************************
Thursday 11 July 2024  12:25:34 +0000 (0:00:28.014)       0:05:25.374 ********* 
FAILED - RETRYING: [testbed-node-1.testbed.osism.xyz]: Wait for grafana application ready (30 retries left).
[...]
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"action": "uri", "attempts": 30, "cache_control": "no-cache", "changed": false, "connection": "close", "content_length": "107", "content_type": "text/html", "elapsed": 0, "msg": "Status code was 503 and not [200]: HTTP Error 503: Service Unavailable", "redirected": false, "status": 503, "url": "https://api-int.testbed.osism.xyz:3000/login"}

PLAY RECAP *********************************************************************
2024-07-11 12:27:38 | INFO     | Play has been completed. There may now be a delay until all logs have been written.
2024-07-11 12:27:38 | INFO     | Please wait and do not abort execution.
testbed-node-0.testbed.osism.xyz : ok=20   changed=3    unreachable=0    failed=1    skipped=3    rescued=0    ignored=0   
testbed-node-1.testbed.osism.xyz : ok=15   changed=4    unreachable=0    failed=1    skipped=5    rescued=0    ignored=0   
testbed-node-2.testbed.osism.xyz : ok=15   changed=3    unreachable=0    failed=0    skipped=5    rescued=0    ignored=0   
TASK [prometheus : Creating prometheus database user and setting permissions] ***
Monday 15 July 2024  14:55:41 +0000 (0:00:10.391)       0:02:30.204 *********** 
failed: [testbed-manager.testbed.osism.xyz] (item=testbed-node-0.testbed.osism.xyz) => {"action": "mysql_user", "ansible_loop_var": "item", "changed": false, "item": {"key": "0", "value": {"hosts": ["testbed-node-0.testbed.osism.xyz", "testbed-node-1.testbed.osism.xyz", "testbed-node-2.testbed.osism.xyz"]}}, "msg": "unable to connect to database, check login_user and login_******** are correct or /var/lib/ansible/.my.cnf has the credentials. Exception message: (2013, 'Lost connection to MySQL server during query')"}

lindenb1 avatar May 28 '24 09:05 lindenb1

Netbox - Manage Ankh-Morpork location:


++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' netbox-netbox-1
+ [[ healthy == \h\e\a\l\t\h\y ]]
+ osism netbox import
2024-05-29 09:32:17 | INFO     | Task 157a9974-9b47-4469-9b4a-95b5a544807a is running. Wait. No more output.
+ osism netbox init
2024-05-29 09:32:21 | INFO     | Task ccb3a63d-a9db-4937-bfcb-c19c77c3bc55 was prepared for execution.
2024-05-29 09:32:21 | INFO     | It takes a moment until task ccb3a63d-a9db-4937-bfcb-c19c77c3bc55 has been started and output is visible here.

PLAY [Wait for netbox service] *************************************************

TASK [Wait for netbox service] *************************************************
ok: [localhost]

PLAY [Manage sites and locations] **********************************************

TASK [Manage Discworld site] ***************************************************
changed: [localhost]

TASK [Manage Ankh-Morpork location] ********************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "\n<!DOCTYPE html>\n<html lang=\"en\">\n\n<head>\n    <title>Server Error</title>\n    <link rel=\"stylesheet\" href=\"/static/netbox-light.css\" />\n    <meta charset=\"UTF-8\">\n</head>\n\n<body>\n    <div class=\"container-fluid\">\n        <div class=\"row\">\n            <div class=\"col col-md-6 offset-md-3\">\n                <div class=\"card border-danger mt-5\">\n                    <h5 class=\"card-header\">\n                        <i class=\"mdi mdi-alert\"></i> Server Error\n                    </h5>\n                    <div class=\"card-body\">\n                        \n                            <p>\n                                There was a problem with your request. Please contact an administrator.\n                            </p>\n                        \n                        <hr />\n                        <p>\n                            The complete exception is provided below:\n                        </p>\n<pre class=\"block\"><strong>&lt;class &#x27;dcim.models.sites.Site.MultipleObjectsReturned&#x27;&gt;</strong><br />\nget() returned more than one Site -- it returned 2!\n\nPython version: 3.10.6\nNetBox version: 3.4.8</pre>\n                        <p>\n                            If further assistance is required, please post to the <a href=\"https://github.com/netbox-community/netbox/discussions\">NetBox discussion forum</a> on GitHub.\n                        </p>\n                        <div class=\"text-end\">\n                            <a href=\"/\" class=\"btn btn-primary\">Home Page</a>\n                        </div>\n                    </div>\n                </div>\n            </div>\n        </div>\n    </div>\n</body>\n\n</html>\n"}

Osism tool can't pull images:

TASK [service-images-pull : barbican | Pull images] ****************************
Wednesday 29 May 2024  09:50:39 +0000 (0:00:02.499)       0:00:13.337 ********* 
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => 
  msg: '[''{{ node_config_directory }}/barbican-api/:{{ container_config_directory }}/:ro'', ''/etc/localtime:/etc/localtime:ro'', "{{ ''/etc/timezone:/etc/timezone:ro'' if ansible_facts.os_family == ''Debian'' else '''' }}", ''barbican:/var/lib/barbican/'', ''kolla_logs:/var/log/kolla/'', "{{ kolla_dev_repos_directory ~ ''/barbican/barbican:/var/lib/kolla/venv/lib/python'' ~ distro_python_version ~ ''/site-packages/barbican'' if barbican_dev_mode | bool else '''' }}"]: ''dict object'' has no attribute ''os_family''. ''dict object'' has no attribute ''os_family''. [''{{ node_config_directory }}/barbican-api/:{{ container_config_directory }}/:ro'', ''/etc/localtime:/etc/localtime:ro'', "{{ ''/etc/timezone:/etc/timezone:ro'' if ansible_facts.os_family == ''Debian'' else '''' }}", ''barbican:/var/lib/barbican/'', ''kolla_logs:/var/log/kolla/'', "{{ kolla_dev_repos_directory ~ ''/barbican/barbican:/var/lib/kolla/venv/lib/python'' ~ distro_python_version ~ ''/site-packages/barbican'' if barbican_dev_mode
    | bool else '''' }}"]: ''dict object'' has no attribute ''os_family''. ''dict object'' has no attribute ''os_family''. [''{{ node_config_directory }}/barbican-api/:{{ container_config_directory }}/:ro'', ''/etc/localtime:/etc/localtime:ro'', "{{ ''/etc/timezone:/etc/timezone:ro'' if ansible_facts.os_family == ''Debian'' else '''' }}", ''barbican:/var/lib/barbican/'', ''kolla_logs:/var/log/kolla/'', "{{ kolla_dev_repos_directory ~ ''/barbican/barbican:/var/lib/kolla/venv/lib/python'' ~ distro_python_version ~ ''/site-packages/barbican'' if barbican_dev_mode | bool else '''' }}"]: ''dict object'' has no attribute ''os_family''. ''dict object'' has no attribute ''os_family''. [''{{ node_config_directory }}/barbican-api/:{{ container_config_directory }}/:ro'', ''/etc/localtime:/etc/localtime:ro'', "{{ ''/etc/timezone:/etc/timezone:ro'' if ansible_facts.os_family == ''Debian'' else '''' }}", ''barbican:/var/lib/barbican/'', ''kolla_logs:/var/log/kolla/'', "{{ kolla_dev_repos_directory ~ ''/barbican/barbican:/var/lib/kolla/venv/lib/python''
    ~ distro_python_version ~ ''/site-packages/barbican'' if barbican_dev_mode | bool else '''' }}"]: ''dict object'' has no attribute ''os_family''. ''dict object'' has no attribute ''os_family'''


PLAY RECAP *********************************************************************
2024-05-29 09:50:43 | INFO     | Play has been completed. There may now be a delay until all logs have been written.
2024-05-29 09:50:43 | INFO     | Please wait and do not abort execution.
testbed-node-0.testbed.osism.xyz : ok=3    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
testbed-node-1.testbed.osism.xyz : ok=3    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
testbed-node-2.testbed.osism.xyz : ok=3    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   

Wednesday 29 May 2024  09:50:43 +0000 (0:00:04.277)       0:00:17.615 ********* 
=============================================================================== 
service-images-pull : barbican | Pull images ---------------------------- 4.28s
Group hosts based on enabled services ----------------------------------- 3.68s
Group hosts based on Kolla action --------------------------------------- 3.53s
barbican : include_tasks ------------------------------------------------ 2.50s

lindenb1 avatar May 29 '24 09:05 lindenb1

drivetemp is currently not activatable at the testbed, despite that integration-test's are working fine:

fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], "stdout": "", "stdout_lines": []}

lindenb1 avatar Jun 05 '24 09:06 lindenb1

Manager:

++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' ceph-ansible
template parsing error: template: :1:8: executing "" at <.State.Health.Status>: map has no entry for key "Health"
+ [[ '' == \h\e\a\l\t\h\y ]]
+ ((  attempt_num++ == max_attempts  ))
+ sleep 5

Deploy:

++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' ceph-ansible
+ [[ healthy == \h\e\a\l\t\h\y ]]
+ wait_for_container_healthy 60 kolla-ansible
+ local max_attempts=60
+ local name=kolla-ansible
+ local attempt_num=1

But I'm finally always running into the same issue:

fatal: [testbed-node-2.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], "stdout": "", "stdout_lines": []}
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], "stdout": "", "stdout_lines": []}
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], "stdout": "", "stdout_lines": []}
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-21-cloud-amd64\n", "name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-21-cloud-amd64\n", "stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-21-cloud-amd64"], "stdout": "", "stdout_lines": []}

It appears to be the case that Ansible does not trigger the task correctly we have cloned to:

/opt/src/osism/ansible-collection-services/ansible-collection-services/roles/hddtemp

Despite that the content of the folder on the Testbed machines is right. We further have to investigate here

lindenb1 avatar Jun 06 '24 14:06 lindenb1

Still getting issues at the end of the deploy step ...

STILL ALIVE [task 'Wait until service is available' is running] ****************
FAILED - RETRYING: [localhost]: Wait until service is available (5 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (4 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (3 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (2 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (1 retries left).
fatal: [localhost]: FAILED! => {"attempts": 30, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "https://keycloak.testbed.osism.xyz/auth/"}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
2024-06-12 11:15:04 | INFO     | Play has been completed. There may now be a delay until all logs have been written.
localhost                  : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   
2024-06-12 11:15:04 | INFO     | Please wait and do not abort execution.

Wednesday 12 June 2024  11:15:04 +0000 (0:03:00.432)       0:03:03.504 ******** 
=============================================================================== 
Wait until service is available --------------------------------------- 180.43s
2024-06-12 11:15:04 | INFO     | Task 9b6b792f-e06d-4085-a438-09e1020a3de2 (keycloak-oidc-client-config) was prepared for execution.
2024-06-12 11:15:04 | INFO     | It takes a moment until task 9b6b792f-e06d-4085-a438-09e1020a3de2 (keycloak-oidc-client-config) has been started and output is visible here.

PLAY [Configure OIDC client for Keystone] **************************************

TASK [Wait until service is available] *****************************************
Wednesday 12 June 2024  11:15:10 +0000 (0:00:02.695)       0:00:02.695 ******** 
FAILED - RETRYING: [localhost]: Wait until service is available (30 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (29 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (28 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (27 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (26 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (25 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (24 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (23 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (22 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (21 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (20 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (19 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (18 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (17 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (16 retries left).
FAILED - RETRYING: [localhost]: Wait until service is available (15 retries left).
ok: [localhost]

TASK [Log in to Keycloak] ******************************************************
Wednesday 12 June 2024  11:16:47 +0000 (0:01:36.705)       0:01:39.401 ******** 
ok: [localhost]

TASK [Get available realms] ****************************************************
Wednesday 12 June 2024  11:16:56 +0000 (0:00:08.584)       0:01:47.985 ******** 
ok: [localhost]

TASK [Filter available realms] *************************************************
Wednesday 12 June 2024  11:17:00 +0000 (0:00:04.717)       0:01:52.702 ******** 
ok: [localhost]

TASK [Create target realm if it doesn't exist] *********************************
Wednesday 12 June 2024  11:17:02 +0000 (0:00:01.599)       0:01:54.303 ******** 
changed: [localhost]

TASK [Get available clients in realm] ******************************************
Wednesday 12 June 2024  11:17:08 +0000 (0:00:05.899)       0:02:00.202 ******** 
ok: [localhost]

TASK [Filter available clients in realm] ***************************************
Wednesday 12 June 2024  11:17:11 +0000 (0:00:03.727)       0:02:03.929 ******** 
ok: [localhost]

TASK [Create OIDC client configuration] ****************************************
Wednesday 12 June 2024  11:17:13 +0000 (0:00:01.724)       0:02:05.653 ******** 
changed: [localhost]

TASK [Get internal ID for client keystone] *************************************
Wednesday 12 June 2024  11:17:18 +0000 (0:00:04.691)       0:02:10.345 ******** 
ok: [localhost]

TASK [Filter internal ID for client keystone] **********************************
Wednesday 12 June 2024  11:17:22 +0000 (0:00:03.786)       0:02:14.132 ******** 
ok: [localhost]

TASK [Get available mappers for client] ****************************************
Wednesday 12 June 2024  11:17:23 +0000 (0:00:01.142)       0:02:15.275 ******** 
ok: [localhost]

TASK [Filter available mappers for client] *************************************
Wednesday 12 June 2024  11:17:27 +0000 (0:00:03.936)       0:02:19.211 ******** 
ok: [localhost]

TASK [Create mappers for client] ***********************************************
Wednesday 12 June 2024  11:17:28 +0000 (0:00:01.378)       0:02:20.590 ******** 
changed: [localhost] => (item=openstack-user-domain)
changed: [localhost] => (item=openstack-default-project)

TASK [Get available components in realm] ***************************************
Wednesday 12 June 2024  11:17:35 +0000 (0:00:06.404)       0:02:26.995 ******** 
ok: [localhost]

TASK [Filter available components in realm] ************************************
Wednesday 12 June 2024  11:17:38 +0000 (0:00:03.627)       0:02:30.623 ******** 
ok: [localhost]

TASK [Add privateKey and certificate to realm] *********************************
Wednesday 12 June 2024  11:17:39 +0000 (0:00:01.311)       0:02:31.935 ******** 
fatal: [localhost]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result"}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
localhost                  : ok=15   changed=3    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   

2024-06-12 11:17:42 | INFO     | Play has been completed. There may now be a delay until all logs have been written.
2024-06-12 11:17:42 | INFO     | Please wait and do not abort execution.
Wednesday 12 June 2024  11:17:42 +0000 (0:00:02.407)       0:02:34.342 ******** 
=============================================================================== 
Wait until service is available ---------------------------------------- 96.71s
Log in to Keycloak ------------------------------------------------------ 8.58s
Create mappers for client ----------------------------------------------- 6.40s
Create target realm if it doesn't exist --------------------------------- 5.90s
Get available realms ---------------------------------------------------- 4.72s
Create OIDC client configuration ---------------------------------------- 4.69s
Get available mappers for client ---------------------------------------- 3.94s
Get internal ID for client keystone ------------------------------------- 3.79s
Get available clients in realm ------------------------------------------ 3.73s
Get available components in realm --------------------------------------- 3.63s
Add privateKey and certificate to realm --------------------------------- 2.41s
Filter available clients in realm --------------------------------------- 1.72s
Filter available realms ------------------------------------------------- 1.60s
Filter available mappers for client ------------------------------------- 1.38s
Filter available components in realm ------------------------------------ 1.31s
Filter internal ID for client keystone ---------------------------------- 1.14s
make[1]: *** [Makefile:115: deploy] Error 2
make[1]: Leaving directory '/home/claris/osism/osism/nobel-testbed/testbed/terraform'
make: *** [Makefile:108: deploy] Error 2

lindenb1 avatar Jun 12 '24 12:06 lindenb1

Health check for ceph-ansible after manager deployment failed:

[…]
++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' ceph-ansible
template parsing error: template: :1:8: executing "" at <.State.Health.Status>: map has no entry for key "Health"
+ [[ '' == \h\e\a\l\t\h\y ]]
+ ((  attempt_num++ == max_attempts  ))
+ sleep 5
++ /usr/bin/docker inspect -f '{{.State.Health.Status}}' ceph-ansible
template parsing error: template: :1:8: executing "" at <.State.Health.Status>: map has no entry for key "Health"
+ [[ '' == \h\e\a\l\t\h\y ]]
+ ((  attempt_num++ == max_attempts  ))
+ return 1
make[1]: *** [Makefile:125: deploy-manager] Error 1

sbstnnmnn avatar Jun 25 '24 08:06 sbstnnmnn

drivetemp is currently not activatable at the testbed, despite that integration-test's are working fine:

fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "msg": "modprobe: FATAL: Module drivetemp >not found in directory /lib/modules/6.1.0-18-cloud-amd64", 
"name": "drivetemp", "params": "", "rc": 1, "state": "present", "stderr": "modprobe: FATAL: Module drivetemp not found >in directory /lib/modules/6.1.0-18-cloud-amd64\", 
"stderr_lines": ["modprobe: FATAL: Module drivetemp not found in directory /lib/modules/6.1.0-18-cloud-amd64"], >"stdout": "", "stdout_lines": []}

Got this error again. Probably connected to this change: https://github.com/osism/testbed/commit/16a35ad37376125f107c5e842aa5492b8468cbc4 and the call of deploy-manager.sh at terraform/Makefile:126.

sbstnnmnn avatar Jun 25 '24 09:06 sbstnnmnn

401 error concerning nexus role:

TASK [osism.services.nexus : Deleting script create_repos_from_list] ***********
Monday 08 July 2024  14:08:40 +0000 (0:00:02.768)       0:00:41.744 *********** 
fatal: [testbed-manager.testbed.osism.xyz]: FAILED! => {"changed": false, "connection": "close", "content_length": "0", "date": "Mon, 08 Jul 2024 14:08:41 GMT", "elapsed": 0, "msg": "Status code was 401 and not [204, 404]: HTTP Error 401: Unauthorized", "redirected": false, "server": "Nexus/3.69.0-02 (OSS)", "status": 401, "url": "https://nexus.testbed.osism.xyz/service/rest/v1/script/create_repos_from_list", "www_authenticate": "BASIC realm=\"Sonatype Nexus Repository Manager\"", "x_content_type_options": "nosniff"}

sbstnnmnn avatar Jul 08 '24 14:07 sbstnnmnn

Docker community galaxy packages are offline:

https://galaxy.ansible.com/api/v3/plugin/ansible/content/published/collections/artifacts/community-docker-3.10.4.tar.gz

File does not exist anymore on AWS.

sbstnnmnn avatar Jul 08 '24 15:07 sbstnnmnn

Debian 12 is now getting deployed with all major stages:

  • 001-helper-services.sh
  • 005-kubernetes.sh
  • 006-kubernetes-clusterapi.sh
  • 100-ceph-services.sh
  • 200-infrastructure-services.sh
  • 300-openstack-services.sh
  • 310-openstack-services-extended.sh
  • 400-monitoring-services.sh

lindenb1 avatar Aug 26 '24 14:08 lindenb1