cloudera.cluster icon indicating copy to clipboard operation
cloudera.cluster copied to clipboard

Improve validation around CM agent install/liveness to prevent opaque failures in later steps

Open sdairs opened this issue 4 years ago • 3 comments

Currently, a failure in installing/starting the CM agent does not prevent the playbook from continuing, which causes issues later as the host name is not present in the list of names returned from the All Hosts page, resulting in the common error

"changed": false, "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute '<hostname>'" }

This error is not at all clear on what or why it failed.

We should probably add some extra validation around this & produce more meaningful errors & failures

Such as

  • Validate that CM agent is installed
  • Validate its heartbeating
  • Validate the All Hosts list against the cluster:children investory

Thoughts?

sdairs avatar Jun 10 '21 12:06 sdairs

Yes, 100% agree. There is a task that waits for hosts to heartbeat in, but if the play on that host has failed then you could get into a scenario where this has failed. Would you like to pick this up @sdairs ?

tmgstevens avatar Jun 10 '21 12:06 tmgstevens

I will take a look at it but I won't be upset if someone else wants to jump on it - it'll take me a moment to understand what needs doing

sdairs avatar Jun 10 '21 15:06 sdairs

@tmgstevens, I don't know if it's related to the described scenario above, but from the attached failure you can see that in one of the cluster machines the Ansible installation couldn't find the CM agent UUID file and the installation has stopped. In this point, if I execute stop and start command to the agent on this particular machine the UUID file will appear but afterword if I execute the installation again I get the same error

image

Please let me know if I should open a dedicated case for this issue

ghost avatar May 10 '22 17:05 ghost