ibm-spectrum-scale-install-infra
ibm-spectrum-scale-install-infra copied to clipboard
Commands should not hang indefinitely - add retry and timeout
Upon attempting to re-run ansible.sh
to add a new node (RHEL7.7) to an existing cluster (4 RHEL8.1 nodes) something occurred to prevent the "Start deamons" command from completing and hung for over an hour with no failure or apparent retry or re-issue of the command.
Would be ideal if:
- user screen saw what command was being issued (to better troubleshoot or understand what is hanging)
- user screen saw a retry attempt (i.e. retrying 1 out of 5 times)
- user screen saw a retry interval (i.e. retrying 1 out of 5 times, waiting 30 more seconds).
- user could configure their own retry interval if desired (i.e. willing to wait 10 times with a 30 second wait interval vs a default setting provided by the playbook).
This hang "may" have been result of ssh key eschange reported in issue #52 The key exchange was performed from another login but the playbook still never timed out or continued to retry the command.
skipping: [shockjaw-vm1]
skipping: [shockjaw-vm3]
skipping: [shockjaw-vm2]
skipping: [shockjaw-vm4]
I also came across this issue, but not able to re-create it. It is stuck on this step for over 5 hours...
At this point, it's not obvious to me on what is going on... without any output to the user in the tasks. So if a customer hit this, how would we collect data or what would we do to figure out what is going on?
So I had to actually go into the tasks before this and figure out what command to run, eventually found this one:
# /usr/lpp/mmfs/bin/mmgetstate -N localhost -Y | grep -v HEADER | cut -d ':' -f 9
arbitrating
# /usr/lpp/mmfs/bin/mmgetstate -N localhost
Node number Node name GPFS state
-------------------------------------------
1 autogen-centos76-dev-x-master arbitrating
In the logs, I only see these kind of messages:
2020-03-24_10:15:17.335-0700: [D] Failed to get a response while probing quorum node 172.16.241.226 (autogen-centos76-dev-x-worker2): error 233
2020-03-24_10:20:22.362-0700: [D] Failed to get a response while probing quorum node 172.16.241.225 (autogen-centos76-dev-x-worker1): error 233
2020-03-24_10:20:22.363-0700: [D] Failed to get a response while probing quorum node 172.16.241.226 (autogen-centos76-dev-x-worker2): error 233
2020-03-24_10:24:26.392-0700: [D] Failed to join cluster gpfs1.local. Make sure enough quorum nodes are up and there are no network issues.
Both workers have a down state:
# ssh autogen-centos76-dev-x-worker1 "mmgetstate"
Node number Node name GPFS state
-------------------------------------------
2 autogen-centos76-dev-x-worker1 down
# ssh autogen-centos76-dev-x-worker2 "mmgetstate"
Node number Node name GPFS state
-------------------------------------------
3 autogen-centos76-dev-x-worker2 down
So to get around this, i did a mmshutdown
and then re-ran the playbook and it went past this issue. So not sure...
To provide an update on my comment above, I've hit this on every deploy from scratch in the last day and we finally figured out the root cause. (At least for my issue) The ssh default config (/etc/ssh/ssh_config
) did not override the default behavior for StrictHostKeyChecking
, to NOT prompt.
This causes that task: TASK [core/cluster : cluster | Start daemons]
to only run on the master node as we can see from the changed:
response only on master.. then hanging. We never see it succeed for the workers in the inventory. Therefore, it never gets to the handler for wait-daemon-active
.
I'm not sure how we would be able to get past this... because it's sitting on the prompt for approving the ssh key; the message: "Are you sure you want to continue (yes/no)" . One possible minor improvement that I could suggest here to help the user figure this out... (if they hit it), is to change:
core/cluster: cluster | Start Daemons
to
core/cluster: cluster | Start Daemons (Hint: ensure that 'StrictHostKeyChecking' is 'no' on all nodes)
If nothing else, it would reduce the debugging time by suggestion a potential solution.