HPCCloud
HPCCloud copied to clipboard
EC2 Clusters getting stuck in states
What are the cluster state is you call the REST endpoint directly?
The same as in the screenshot there.
On Fri, May 20, 2016 at 1:54 PM, Chris Harris [email protected] wrote:
What are the cluster state is you call the REST endpoint directly?
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/Kitware/HPCCloud/issues/407#issuecomment-220703245
- Tristan Wright
R&D Engineer Kitware Inc.
and what about their logs?
The ui matches the what the endpoint gives. Here's the end of one (0099900) formatted:
[15:33:34.335] INFO: TASK: exec : Install NFS client [starting] [15:33:34.338] INFO: TASK: exec : Install NFS client [finished] { "_ansible_no_log": false, "cache_update_time": 0, "cache_updated": false, "changed": false, "module_name": "apt" } [15:33:34.340] INFO: TASK: exec : Install NFS client [finished] { "_ansible_no_log": false, "cache_update_time": 0, "cache_updated": false, "changed": true, "module_name": "apt", "stderr": "\nCreating config file /etc/idmapd.conf with new version\n\nCreating config file /etc/default/nfs-common with new version\n", "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nThe following extra packages will be installed:\n keyutils libgssglue1 libnfsidmap2 libtirpc1 rpcbind\nSuggested packages:\n open-iscsi watchdog\nThe following NEW packages will be installed:\n keyutils libgssglue1 libnfsidmap2 libtirpc1 nfs-common rpcbind\n0 upgraded, 6 newly installed, 0 to remove and 29 not upgraded.\nNeed to get 375 kB of archives.\nAfter this operation, 1524 kB of additional disk space will be used.\nGet:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main libgssglue1 amd64 0.4-2ubuntu1 [19.7 kB]\nGet:2 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main libnfsidmap2 amd64 0.25-5 [32.2 kB]\nGet:3 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main libtirpc1 amd64 0.2.2-5ubuntu2 [71.3 kB]\nGet:4 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main keyutils amd64 1.5.6-1 [33.6 kB]\nGet:5 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty-updates/main rpcbind amd64 0.2.1-2ubuntu2.2 [37.1 kB]\nGet:6 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty-updates/main nfs-common amd64 1:1.2.8-6ubuntu1.2 [181 kB]\nFetched 375 kB in 0s (9351 kB/s)\nSelecting previously unselected package libgssglue1:amd64.\n(Reading database ... 74563 files and directories currently installed.)\nPreparing to unpack .../libgssglue1_0.4-2ubuntu1_amd64.deb ...\nUnpacking libgssglue1:amd64 (0.4-2ubuntu1) ...\nSelecting previously unselected package libnfsidmap2:amd64.\nPreparing to unpack .../libnfsidmap2_0.25-5_amd64.deb ...\nUnpacking libnfsidmap2:amd64 (0.25-5) ...\nSelecting previously unselected package libtirpc1:amd64.\nPreparing to unpack .../libtirpc1_0.2.2-5ubuntu2_amd64.deb ...\nUnpacking libtirpc1:amd64 (0.2.2-5ubuntu2) ...\nSelecting previously unselected package keyutils.\nPreparing to unpack .../keyutils_1.5.6-1_amd64.deb ...\nUnpacking keyutils (1.5.6-1) ...\nSelecting previously unselected package rpcbind.\nPreparing to unpack .../rpcbind_0.2.1-2ubuntu2.2_amd64.deb ...\nUnpacking rpcbind (0.2.1-2ubuntu2.2) ...\nSelecting previously unselected package nfs-common.\nPreparing to unpack .../nfs-common_1%3a1.2.8-6ubuntu1.2_amd64.deb ...\nUnpacking nfs-common (1:1.2.8-6ubuntu1.2) ...\nProcessing triggers for man-db (2.6.7.1-1ubuntu1) ...\nProcessing triggers for ureadahead (0.100.0-16) ...\nSetting up libgssglue1:amd64 (0.4-2ubuntu1) ...\nSetting up libnfsidmap2:amd64 (0.25-5) ...\nSetting up libtirpc1:amd64 (0.2.2-5ubuntu2) ...\nSetting up keyutils (1.5.6-1) ...\nSetting up rpcbind (0.2.1-2ubuntu2.2) ...\n Removing any system startup links for /etc/init.d/rpcbind ...\nrpcbind start/running, process 2837\nProcessing triggers for ureadahead (0.100.0-16) ...\nSetting up nfs-common (1:1.2.8-6ubuntu1.2) ...\nAdding system user `statd' (UID 110) ...\nAdding new user `statd' (UID 110) with group `nogroup' ...\nNot creating home directory `/var/lib/nfs'.\nstatd start/running, process 3062\ngssd stop/pre-start, process 3096\nidmapd start/running, process 3146\nProcessing triggers for libc-bin (2.19-0ubuntu6.7) ...\nProcessing triggers for ureadahead (0.100.0-16) ...\n", "stdout_lines": [ "Reading package lists...", "Building dependency tree...", "Reading state information...", "The following extra packages will be installed:", " keyutils libgssglue1 libnfsidmap2 libtirpc1 rpcbind", "Suggested packages:", " open-iscsi watchdog", "The following NEW packages will be installed:", " keyutils libgssglue1 libnfsidmap2 libtirpc1 nfs-common rpcbind", "0 upgraded, 6 newly installed, 0 to remove and 29 not upgraded.", "Need to get 375 kB of archives.", "After this operation, 1524 kB of additional disk space will be used.", "Get:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main libgssglue1 amd64 0.4-2ubuntu1 [19.7 kB]", "Get:2 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main libnfsidmap2 amd64 0.25-5 [32.2 kB]", "Get:3 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main libtirpc1 amd64 0.2.2-5ubuntu2 [71.3 kB]", "Get:4 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main keyutils amd64 1.5.6-1 [33.6 kB]", "Get:5 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty-updates/main rpcbind amd64 0.2.1-2ubuntu2.2 [37.1 kB]", "Get:6 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty-updates/main nfs-common amd64 1:1.2.8-6ubuntu1.2 [181 kB]", "Fetched 375 kB in 0s (9351 kB/s)", "Selecting previously unselected package libgssglue1:amd64.", "(Reading database ... 74563 files and directories currently installed.)", "Preparing to unpack .../libgssglue1_0.4-2ubuntu1_amd64.deb ...", "Unpacking libgssglue1:amd64 (0.4-2ubuntu1) ...", "Selecting previously unselected package libnfsidmap2:amd64.", "Preparing to unpack .../libnfsidmap2_0.25-5_amd64.deb ...", "Unpacking libnfsidmap2:amd64 (0.25-5) ...", "Selecting previously unselected package libtirpc1:amd64.", "Preparing to unpack .../libtirpc1_0.2.2-5ubuntu2_amd64.deb ...", "Unpacking libtirpc1:amd64 (0.2.2-5ubuntu2) ...", "Selecting previously unselected package keyutils.", "Preparing to unpack .../keyutils_1.5.6-1_amd64.deb ...", "Unpacking keyutils (1.5.6-1) ...", "Selecting previously unselected package rpcbind.", "Preparing to unpack .../rpcbind_0.2.1-2ubuntu2.2_amd64.deb ...", "Unpacking rpcbind (0.2.1-2ubuntu2.2) ...", "Selecting previously unselected package nfs-common.", "Preparing to unpack .../nfs-common_1%3a1.2.8-6ubuntu1.2_amd64.deb ...", "Unpacking nfs-common (1:1.2.8-6ubuntu1.2) ...", "Processing triggers for man-db (2.6.7.1-1ubuntu1) ...", "Processing triggers for ureadahead (0.100.0-16) ...", "Setting up libgssglue1:amd64 (0.4-2ubuntu1) ...", "Setting up libnfsidmap2:amd64 (0.25-5) ...", "Setting up libtirpc1:amd64 (0.2.2-5ubuntu2) ...", "Setting up keyutils (1.5.6-1) ...", "Setting up rpcbind (0.2.1-2ubuntu2.2) ...", " Removing any system startup links for /etc/init.d/rpcbind ...", "rpcbind start/running, process 2837", "Processing triggers for ureadahead (0.100.0-16) ...", "Setting up nfs-common (1:1.2.8-6ubuntu1.2) ...", "Adding system user `statd' (UID 110) ...", "Adding new user `statd' (UID 110) with group `nogroup' ...", "Not creating home directory `/var/lib/nfs'.", "statd start/running, process 3062", "gssd stop/pre-start, process 3096", "idmapd start/running, process 3146", "Processing triggers for libc-bin (2.19-0ubuntu6.7) ...", "Processing triggers for ureadahead (0.100.0-16) ..." ] }
[15:33:34.340] INFO: TASK: exec : Mounting /home from master [starting] [15:33:34.340] INFO: TASK: exec : Mounting /home from master [skipped] { "host": "52.38.206.199" } [15:33:34.341] INFO: TASK: exec : Mounting /home from master [finished] { "_ansible_no_log": false, "changed": true, "fstab": "/etc/fstab", "fstype": "nfs", "module_name": "mount", "name": "/home", "src": "172.31.31.74:/home" }
So are the EC2 instances still running? Can you look for errors in /var/log/celery/command.log?
Instances aren't on the aws dashboard anymore...
This one in celery/command.log looks a little suspect:
[2016-05-04 20:51:34,616: INFO/Worker-5] cumulus.ansible.tasks.cluster.provision_cluster[None]: changed: [52.39.154.233] => (item=[u'gridengine-client', u'gridengine-master', u'gridengine-exec']) => {"cache_update_time": 0, "cache_updated": false, "changed": true, "invocation": {"module_args": {"cache_valid_time": null, "deb": null, "default_release": null, "dpkg_options": "force-confdef,force-confold", "force": false, "install_recommends": null, "name": ["gridengine-client", "gridengine-master", "gridengine-exec"], "package": ["gridengine-client", "gridengine-master", "gridengine-exec"], "purge": false, "state": "present", "update_cache": false, "upgrade": null}, "module_name": "apt"}, "item": ["gridengine-client", "gridengine-master", "gridengine-exec"], "stderr": "..........................critical error: abort qmaster registration due to communication errors\n\ndaemonize error: child exited before sending daemonize state\nInitializing cluster with the following parameters:\n => SGE_ROOT: /var/lib/gridengine\n => SGE_CELL: default\n => Spool directory: /var/spool/gridengine/spooldb\n => Initial manager user: sgeadmin\nInitializing spool (/var/spool/gridengine/spooldb)\nInitializing global configuration based on /usr/share/gridengine/default-configuration\nInitializing complexes based on /usr/share/gridengine/centry\nInitializing usersets based on /usr/share/gridengine/usersets\nAdding user sgeadmin as a manager\nCluster creation complete\n", "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nSuggested packages:\n gridengine-qmon\nThe following NEW packages will be installed:\n gridengine-client gridengine-exec gridengine-master\n0 upgraded, 3 newly installed, 0 to remove and 4 not upgraded.\nNeed to get 2412 kB/6669 kB of archives.\nAfter this operation, 43.5 MB of additional disk space will be used.\nGet:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/universe gridengine-master amd64 6.2u5-7.3 [2412 kB]\nPreconfiguring packages ...\nFetched 2412 kB in 0s (16.6 MB/s)\nSelecting previously unselected package gridengine-client.\n(Reading database ... 58612 files and directories currently installed.)\nPreparing to unpack .../gridengine-client_6.2u5-7.3_amd64.deb ...\nUnpacking gridengine-client (6.2u5-7.3) ...\nSelecting previously unselected package gridengine-exec.\nPreparing to unpack .../gridengine-exec_6.2u5-7.3_amd64.deb ...\nUnpacking gridengine-exec (6.2u5-7.3) ...\nSelecting previously unselected package gridengine-master.\nPreparing to unpack .../gridengine-master_6.2u5-7.3_amd64.deb ...\nUnpacking gridengine-master (6.2u5-7.3) ...\nProcessing triggers for man-db (2.6.7.1-1ubuntu1) ...\nProcessing triggers for ureadahead (0.100.0-16) ...\nSetting up gridengine-client (6.2u5-7.3) ...\nSetting up gridengine-exec (6.2u5-7.3) ...\nSetting up gridengine-master (6.2u5-7.3) ...\nProcessing triggers for ureadahead (0.100.0-16) ...\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "Suggested packages:", " gridengine-qmon", "The following NEW packages will be installed:", " gridengine-client gridengine-exec gridengine-master", "0 upgraded, 3 newly installed, 0 to remove and 4 not upgraded.", "Need to get 2412 kB/6669 kB of archives.", "After this operation, 43.5 MB of additional disk space will be used.", "Get:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/universe gridengine-master amd64 6.2u5-7.3 [2412 kB]", "Preconfiguring packages ...", "Fetched 2412 kB in 0s (16.6 MB/s)", "Selecting previously unselected package gridengine-client.", "(Reading database ... 58612 files and directories currently installed.)", "Preparing to unpack .../gridengine-client_6.2u5-7.3_amd64.deb ...", "Unpacking gridengine-client (6.2u5-7.3) ...", "Selecting previously unselected package gridengine-exec.", "Preparing to unpack .../gridengine-exec_6.2u5-7.3_amd64.deb ...", "Unpacking gridengine-exec (6.2u5-7.3) ...", "Selecting previously unselected package gridengine-master.", "Preparing to unpack .../gridengine-master_6.2u5-7.3_amd64.deb ...", "Unpacking gridengine-master (6.2u5-7.3) ...", "Processing triggers for man-db (2.6.7.1-1ubuntu1) ...", "Processing triggers for ureadahead (0.100.0-16) ...", "Setting up gridengine-client (6.2u5-7.3) ...", "Setting up gridengine-exec (6.2u5-7.3) ...", "Setting up gridengine-master (6.2u5-7.3) ...", "Processing triggers for ureadahead (0.100.0-16) ..."]}
Yep, that looks like the root cause, but I would have expected that the cluster would have move into the error state ...
After fixing the issue preventing error state from transitioning to anything else I haven't seen this. Closeable?
I'm seeing this again. I have a few stuck in terminating, provisioning, or launching states.
In the case of "my new 52", the last one in that list, I can see the instances are still running in the console.