workshops icon indicating copy to clipboard operation
workshops copied to clipboard

AAP 2.0 Linux WS in RHPDS: Additional checks required so failed deployment don't report as successful

Open benblasco opened this issue 3 years ago • 4 comments

Problem Summary

Note: This issue relates to checks done as the workshop deploys and are probably agnostic to the deployment method. Issue created after discussion with @abenokraitis

I have classified this as a bug, as I believe that we are effectively generating false positives for workshop deployments, which in my case led to me having to postpone a workshop by a week due to the failure.

My workshop deployed "successfully" in RHPDS for 35 students, but at least 50% of the Ansible Controller instances were unreachable due to a failure in an external dependency as covered by issue 1427. Unreachable = not responding to ping, and not responding to HTTP/HTTPS requests.

Example error message from this workshop:

There are all sorts of errors in the deployer log for c558
RPM's being inaccessible
fatal: [c558-student27-ansible-1]: FAILED! => {"changed": false, "dest": "/tmp/code-server.rpm", "elapsed": 10, "msg": "Connection failure: The read operation timed out", "url": "https://gi
thub.com/cdr/code-server/releases/download/v3.10.2/code-server-3.10.2-amd64.rpm"}
TASK [ansible.workshops.code_server : Download code-server 3 rpm] **************
fatal: [c558-student20-ansible-1]: FAILED! => {"changed": false, "dest": "/tmp/code-server.rpm", "elapsed": 10, "msg": "Connection failure: The read operation timed out", "url": "https://gi
thub.com/cdr/code-server/releases/download/v3.10.2/code-server-3.10.2-amd64.rpm"}

The issue is not with the specific failure, but with the lack of checks to see if the deployment was successful. Can we add checks that everything deployed is reachable before declaring success? ie Ansible Controller, Gitlab instance, managed hosts (to automate on) etc?

Issue Type

Bug

Extra vars file

N/A (ie whatever is in RHPDS)

Ansible Playbook Output

N/A (ie whatever is in RHPDS)

Ansible Version

N/A (ie whatever is in RHPDS)

Ansible Configuration

N/A (ie whatever is in RHPDS)

Ansible Execution Node

Ansible Controller (previously known as Ansible Tower)

Operating System

N/A (ie whatever is in RHPDS)

benblasco avatar Nov 21 '21 22:11 benblasco

I used to use this https://github.com/ffirg/workshop_checks for checking workshops post deployment. This was written before AAP2 revisions but can get you out of a hole when things go pop in the public cloud.

ffirg avatar Nov 21 '21 22:11 ffirg

Please reopen this @IPvSean @anshulbehl . Fixing the issue in 1429 does not address the broader problem of having checks to verify that the workshop deployment actually completed successfully. Can we please reopen this issue?

benblasco avatar Nov 23 '21 00:11 benblasco

I think this is an issue on the RHPDS side... that task should fail and stop provisioning... for some reason RHPDS is continuing to send it on.... I am not sure I understand @tonykay

IPvSean avatar Nov 23 '21 17:11 IPvSean

Any updates on this one @IPvSean @tonykay ?

benblasco avatar Dec 06 '21 12:12 benblasco