ibm-spectrum-scale-install-infra icon indicating copy to clipboard operation
ibm-spectrum-scale-install-infra copied to clipboard

Check the configuration for required variables and fail the playbook early on with a good message

Open whowutwut opened this issue 4 years ago • 4 comments

For required variables, could we check the configuration as we start running the playbooks and fail if missing with a message?

I do see this comment:

The filesystem parameter is mandatory, servers, and the device parameter is mandatory for each of the file system's disks. All other file system and disk parameters are optional. Hence, a minimal file system configuration would look like this:

I am currently using this repo in multiple internal deployment for automation where I clone down the repo and branch. Right now we have no official tags but we are changing things pretty drastically. In order to hold some stability, I will clone against my forked copy. But as I move master up to track upstream, I potentially break things.

I just tested a branch based to a more recent commit, and so I hit code changes that required servers to be defined. The error I hit is:

fatal: [worker4]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'servers'\n\nThe error appears to be in '/root/ibm-spectrum-scale-install-infra/roles/core/cluster/tasks/storage.yml': line 46, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: storage | Find defined NSDs\n ^ here\n"}

So while this is OK for me, to go into that play and try and figure out what is missing... I had to add some debug to make sure it was servers missing.. I believe it would drastically improve the usability if we had a check up front for the required values and stop the playbook with a informative message.

And I fully understand that it's documented in the README, and we could consider this "user error", but unless we have changelog and release info, most customers are not going to continuously read the REAMDE and/or diff the README file to see what has changed between one commit to another commit.

whowutwut avatar Apr 14 '20 01:04 whowutwut

@whowutwut I agree with this one.

Intentionally left the secondary server of an NSD out of the hosts file and ran the playbook..... the run later fails on mmchlicense command for missing server but this could have been detected and corrected much earlier.

# hosts:
[cluster01]
node-vm1 scale_cluster_quorum=true   scale_cluster_manager=true scale_cluster_gui=false
node-vm2 scale_cluster_quorum=true   scale_cluster_manager=true scale_cluster_gui=false
node-vm3 scale_cluster_quorum=true   scale_cluster_manager=true scale_cluster_gui=true
You have mail in /var/spool/mail/root
[root@node-vm1 ibm-spectrum-scale-install-infra]# cat group_vars/all.yml
---
scale_storage:
  - filesystem: gpfsscorch
    disks:
      - device: /dev/sdf
        servers: node-vm3,node-vm4    <<< node-vm4 was not added to the hosts which eventually led to the failure:

TASK [core/cluster : storage | Accept server license for NSD servers] ********************************************************************************************************
fatal: [node-vm1]: FAILED! => {"changed": true, "cmd": ["/usr/lpp/mmfs/bin/mmchlicense", "server", "--accept", "-N", "node-vm3,node-vm4,node-vm3"], "delta": "0:00:00.665907", "end": "2020-06-25 14:14:55.825218", "msg": "non-zero return code", "rc": 1, "start": "2020-06-25 14:14:55.159311", "stderr": "mmchlicense: Incorrect node node-vm4 specified for command.\nmmchlicense: No nodes were found that matched the input specification.\nmmchlicense: Command failed. Examine previous error messages to determine cause.", "stderr_lines": ["mmchlicense: Incorrect node node-vm4 specified for command.", "mmchlicense: No nodes were found that matched the input specification.", "mmchlicense: Command failed. Examine previous error messages to determine cause."], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ***********************************************************************************************************************************************************

PLAY RECAP *******************************************************************************************************************************************************************
node-vm1                 : ok=90   changed=16   unreachable=0    failed=1    skipped=56   rescued=0    ignored=0
node-vm2                 : ok=61   changed=8    unreachable=0    failed=0    skipped=24   rescued=0    ignored=0
node-vm3                 : ok=61   changed=8    unreachable=0    failed=0    skipped=24   rescued=0    ignored=0

mrolyat avatar Jun 25 '20 21:06 mrolyat

Not 100% related to this, but still an issue about failing out of the entire run. Today I hit a problem with the master node, where the IP address in /etc/hosts did not match the active IP address on this node. I ssh into master on a different subnet, but the nodes should run the playbook against the 10.subnet

Since the /etc/hosts IP address for the master node was incorrect, it failed immediately with unreachable.

2020-07-06 21:33:03,253 p=3062 u=root n=ansible | [WARNING]: running playbook inside collection ibm_spectrum_scale.install_infra

2020-07-06 21:33:03,345 p=3062 u=root n=ansible | [WARNING]: Invalid characters were found in group names but not replaced, use
-vvvv to see details

2020-07-06 21:33:08,288 p=3062 u=root n=ansible | PLAY [scale-access2-x] *********************************************************
2020-07-06 21:33:09,147 p=3062 u=root n=ansible | TASK [Gathering Facts] *********************************************************
2020-07-06 21:33:11,673 p=3062 u=root n=ansible | ok: [scale-access2-x-worker1]
2020-07-06 21:33:11,710 p=3062 u=root n=ansible | ok: [scale-access2-x-worker2]
2020-07-06 21:33:12,245 p=3062 u=root n=ansible | fatal: [scale-access2-x-master]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host scale-access2-x-master port 22: No route to host", "unreachable": true}
...




...
...
2020-07-06 21:45:14,526 p=3062 u=root n=ansible | PLAY RECAP *********************************************************************
2020-07-06 21:45:14,526 p=3062 u=root n=ansible | scale-access2-x-master     : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0
2020-07-06 21:45:14,527 p=3062 u=root n=ansible | scale-access2-x-worker1    : ok=217  changed=15   unreachable=0    failed=0    skipped=237  rescued=0    ignored=0
2020-07-06 21:45:14,527 p=3062 u=root n=ansible | scale-access2-x-worker2    : ok=154  changed=8    unreachable=0    failed=0    skipped=92   rescued=0    ignored=0

At the end, the playbook finished running but spectrum scale was not installed correctly. Is there any reason we do not just fail if certain nodes are not reachable?

[root@scale-access2-x-master install-infra]# mmlscluster
-bash: mmlscluster: command not found
[root@scale-access2-x-master install-infra]# /usr/lpp/mmfs
-bash: /usr/lpp/mmfs: No such file or directory

whowutwut avatar Jul 07 '20 05:07 whowutwut

@whowutwut Lots of issues in chef based toolkit where /etc/hosts is formatted in different ways (many times IP short FQDN vs IP FQDN short) on different nodes so some validation here is a good thing to help with consistency, best practice and catch failures early on.

mrolyat avatar Jul 07 '20 18:07 mrolyat

Yea, this was a config issue in the environment, but because it ran on workers without master, mmlscluster did not work from master. . I guess i should have fixed the master and re-ran and see if everything is skipped on the workers (idempotent) but if the cluster doesn't function after this playbook is run, should we even let it run?

I guess we should discuss what behavior we want to have.. https://docs.ansible.com/ansible/latest/user_guide/playbooks_delegation.html#maximum-failure-percentage

whowutwut avatar Jul 08 '20 04:07 whowutwut