Stability of script for HDP platform
This isn't meant as a criticism, as I realise there are 1,000 possible things that could be going wrong, but this script seems to only successfully deploy a cluster in around 1 in 5 attempts.
The exception seems to be different each time, but the common ones are: at upload of config scripts:
Uploading ...20150811-000113-Hq6/install-ambari-components.sh: 3.9 KiB/3.9 KiB
CommandException: 1 files/objects could not be transferred.
when running deploy scripts on master / workers:
Mon, Aug 10, 2015 11:55:27 PM: Exited 1 : gcloud --project=yyyy --quiet --verbosity=info compute ssh hadoop-w-1 --command=sudo su -l -c "cd ${PWD} && ./ambari-setup.sh" 2>>ambari-setup_deploy.stderr 1>>ambari-setup_deploy.stdout --ssh-flag=-tt --ssh-flag=-oServerAliveInterval=60 -- ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=europe-west1-b
Mon, Aug 10, 2015 11:55:28 PM: Fetching on-VM logs from hadoop-w-1
Warning: Permanently added 'x.y.z.m' (RSA) to the list of known hosts.
...Mon, Aug 10, 2015 11:57:43 PM: Command failed: wait ${SUBPROC} on line 326.
during the ambari-components install
Mon, Aug 10, 2015 11:43:54 PM: Step 'deploy-client-nfs-setup,deploy-client-nfs-setup' done...
Mon, Aug 10, 2015 11:43:54 PM: Invoking on master: ./install-ambari-components.sh ../bdutil: line 318: 10548 Segmentation fault sleep '0.5'
By their nature they are hard to reproduce, as I am running the same script each time.
Thanks, every report helps :)
The "Segmentation fault" error is something we've never seen before; do you happen to know if the errors you're hitting are specific to ambari_env.sh, or do they also happen when you try to deploy default bdutil clusters?