HPCCloud icon indicating copy to clipboard operation
HPCCloud copied to clipboard

EC2 Clusters getting stuck in states

Open TristanWright opened this issue 8 years ago • 9 comments

screen shot 2016-05-20 at 13 27 16

TristanWright avatar May 20 '16 19:05 TristanWright

What are the cluster state is you call the REST endpoint directly?

cjh1 avatar May 20 '16 19:05 cjh1

The same as in the screenshot there.

On Fri, May 20, 2016 at 1:54 PM, Chris Harris [email protected] wrote:

What are the cluster state is you call the REST endpoint directly?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/Kitware/HPCCloud/issues/407#issuecomment-220703245

  • Tristan Wright

R&D Engineer Kitware Inc.

TristanWright avatar May 20 '16 19:05 TristanWright

and what about their logs?

cjh1 avatar May 20 '16 19:05 cjh1

The ui matches the what the endpoint gives. Here's the end of one (0099900) formatted:

[15:33:34.335] INFO: TASK: exec : Install NFS client [starting]
[15:33:34.338] INFO: TASK: exec : Install NFS client [finished]
{
  "_ansible_no_log": false,
  "cache_update_time": 0,
  "cache_updated": false,
  "changed": false,
  "module_name": "apt"
}
[15:33:34.340] INFO: TASK: exec : Install NFS client [finished]
{
  "_ansible_no_log": false,
  "cache_update_time": 0,
  "cache_updated": false,
  "changed": true,
  "module_name": "apt",
  "stderr": "\nCreating config file /etc/idmapd.conf with new
version\n\nCreating config file /etc/default/nfs-common with new
version\n",
  "stdout": "Reading package lists...\nBuilding dependency
tree...\nReading state information...\nThe following extra packages
will be installed:\n  keyutils libgssglue1 libnfsidmap2 libtirpc1
rpcbind\nSuggested packages:\n  open-iscsi watchdog\nThe following NEW
packages will be installed:\n  keyutils libgssglue1 libnfsidmap2
libtirpc1 nfs-common rpcbind\n0 upgraded, 6 newly installed, 0 to
remove and 29 not upgraded.\nNeed to get 375 kB of archives.\nAfter
this operation, 1524 kB of additional disk space will be used.\nGet:1
http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main
libgssglue1 amd64 0.4-2ubuntu1 [19.7 kB]\nGet:2
http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main
libnfsidmap2 amd64 0.25-5 [32.2 kB]\nGet:3
http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main libtirpc1
amd64 0.2.2-5ubuntu2 [71.3 kB]\nGet:4
http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main keyutils
amd64 1.5.6-1 [33.6 kB]\nGet:5
http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty-updates/main
rpcbind amd64 0.2.1-2ubuntu2.2 [37.1 kB]\nGet:6
http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty-updates/main
nfs-common amd64 1:1.2.8-6ubuntu1.2 [181 kB]\nFetched 375 kB in 0s
(9351 kB/s)\nSelecting previously unselected package
libgssglue1:amd64.\n(Reading database ... 74563 files and directories
currently installed.)\nPreparing to unpack
.../libgssglue1_0.4-2ubuntu1_amd64.deb ...\nUnpacking
libgssglue1:amd64 (0.4-2ubuntu1) ...\nSelecting previously unselected
package libnfsidmap2:amd64.\nPreparing to unpack
.../libnfsidmap2_0.25-5_amd64.deb ...\nUnpacking libnfsidmap2:amd64
(0.25-5) ...\nSelecting previously unselected package
libtirpc1:amd64.\nPreparing to unpack
.../libtirpc1_0.2.2-5ubuntu2_amd64.deb ...\nUnpacking libtirpc1:amd64
(0.2.2-5ubuntu2) ...\nSelecting previously unselected package
keyutils.\nPreparing to unpack .../keyutils_1.5.6-1_amd64.deb
...\nUnpacking keyutils (1.5.6-1) ...\nSelecting previously unselected
package rpcbind.\nPreparing to unpack
.../rpcbind_0.2.1-2ubuntu2.2_amd64.deb ...\nUnpacking rpcbind
(0.2.1-2ubuntu2.2) ...\nSelecting previously unselected package
nfs-common.\nPreparing to unpack
.../nfs-common_1%3a1.2.8-6ubuntu1.2_amd64.deb ...\nUnpacking
nfs-common (1:1.2.8-6ubuntu1.2) ...\nProcessing triggers for man-db
(2.6.7.1-1ubuntu1) ...\nProcessing triggers for ureadahead
(0.100.0-16) ...\nSetting up libgssglue1:amd64 (0.4-2ubuntu1)
...\nSetting up libnfsidmap2:amd64 (0.25-5) ...\nSetting up
libtirpc1:amd64 (0.2.2-5ubuntu2) ...\nSetting up keyutils (1.5.6-1)
...\nSetting up rpcbind (0.2.1-2ubuntu2.2) ...\n Removing any system
startup links for /etc/init.d/rpcbind ...\nrpcbind start/running,
process 2837\nProcessing triggers for ureadahead (0.100.0-16)
...\nSetting up nfs-common (1:1.2.8-6ubuntu1.2) ...\nAdding system
user `statd' (UID 110) ...\nAdding new user `statd' (UID 110) with
group `nogroup' ...\nNot creating home directory
`/var/lib/nfs'.\nstatd start/running, process 3062\ngssd
stop/pre-start, process 3096\nidmapd start/running, process
3146\nProcessing triggers for libc-bin (2.19-0ubuntu6.7)
...\nProcessing triggers for ureadahead (0.100.0-16) ...\n",
  "stdout_lines": [
    "Reading package lists...",
    "Building dependency tree...",
    "Reading state information...",
    "The following extra packages will be installed:",
    "  keyutils libgssglue1 libnfsidmap2 libtirpc1 rpcbind",
    "Suggested packages:",
    "  open-iscsi watchdog",
    "The following NEW packages will be installed:",
    "  keyutils libgssglue1 libnfsidmap2 libtirpc1 nfs-common rpcbind",
    "0 upgraded, 6 newly installed, 0 to remove and 29 not upgraded.",
    "Need to get 375 kB of archives.",
    "After this operation, 1524 kB of additional disk space will be used.",
    "Get:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main
libgssglue1 amd64 0.4-2ubuntu1 [19.7 kB]",
    "Get:2 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main
libnfsidmap2 amd64 0.25-5 [32.2 kB]",
    "Get:3 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main
libtirpc1 amd64 0.2.2-5ubuntu2 [71.3 kB]",
    "Get:4 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/main
keyutils amd64 1.5.6-1 [33.6 kB]",
    "Get:5 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/
trusty-updates/main rpcbind amd64 0.2.1-2ubuntu2.2 [37.1 kB]",
    "Get:6 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/
trusty-updates/main nfs-common amd64 1:1.2.8-6ubuntu1.2 [181 kB]",
    "Fetched 375 kB in 0s (9351 kB/s)",
    "Selecting previously unselected package libgssglue1:amd64.",
    "(Reading database ... 74563 files and directories currently installed.)",
    "Preparing to unpack .../libgssglue1_0.4-2ubuntu1_amd64.deb ...",
    "Unpacking libgssglue1:amd64 (0.4-2ubuntu1) ...",
    "Selecting previously unselected package libnfsidmap2:amd64.",
    "Preparing to unpack .../libnfsidmap2_0.25-5_amd64.deb ...",
    "Unpacking libnfsidmap2:amd64 (0.25-5) ...",
    "Selecting previously unselected package libtirpc1:amd64.",
    "Preparing to unpack .../libtirpc1_0.2.2-5ubuntu2_amd64.deb ...",
    "Unpacking libtirpc1:amd64 (0.2.2-5ubuntu2) ...",
    "Selecting previously unselected package keyutils.",
    "Preparing to unpack .../keyutils_1.5.6-1_amd64.deb ...",
    "Unpacking keyutils (1.5.6-1) ...",
    "Selecting previously unselected package rpcbind.",
    "Preparing to unpack .../rpcbind_0.2.1-2ubuntu2.2_amd64.deb ...",
    "Unpacking rpcbind (0.2.1-2ubuntu2.2) ...",
    "Selecting previously unselected package nfs-common.",
    "Preparing to unpack .../nfs-common_1%3a1.2.8-6ubuntu1.2_amd64.deb ...",
    "Unpacking nfs-common (1:1.2.8-6ubuntu1.2) ...",
    "Processing triggers for man-db (2.6.7.1-1ubuntu1) ...",
    "Processing triggers for ureadahead (0.100.0-16) ...",
    "Setting up libgssglue1:amd64 (0.4-2ubuntu1) ...",
    "Setting up libnfsidmap2:amd64 (0.25-5) ...",
    "Setting up libtirpc1:amd64 (0.2.2-5ubuntu2) ...",
    "Setting up keyutils (1.5.6-1) ...",
    "Setting up rpcbind (0.2.1-2ubuntu2.2) ...",
    " Removing any system startup links for /etc/init.d/rpcbind ...",
    "rpcbind start/running, process 2837",
    "Processing triggers for ureadahead (0.100.0-16) ...",
    "Setting up nfs-common (1:1.2.8-6ubuntu1.2) ...",
    "Adding system user `statd' (UID 110) ...",
    "Adding new user `statd' (UID 110) with group `nogroup' ...",
    "Not creating home directory `/var/lib/nfs'.",
    "statd start/running, process 3062",
    "gssd stop/pre-start, process 3096",
    "idmapd start/running, process 3146",
    "Processing triggers for libc-bin (2.19-0ubuntu6.7) ...",
    "Processing triggers for ureadahead (0.100.0-16) ..."
  ]
}

[15:33:34.340] INFO: TASK: exec : Mounting /home from master [starting] [15:33:34.340] INFO: TASK: exec : Mounting /home from master [skipped] { "host": "52.38.206.199" } [15:33:34.341] INFO: TASK: exec : Mounting /home from master [finished] { "_ansible_no_log": false, "changed": true, "fstab": "/etc/fstab", "fstype": "nfs", "module_name": "mount", "name": "/home", "src": "172.31.31.74:/home" }

TristanWright avatar May 20 '16 20:05 TristanWright

So are the EC2 instances still running? Can you look for errors in /var/log/celery/command.log?

cjh1 avatar May 20 '16 20:05 cjh1

Instances aren't on the aws dashboard anymore...

This one in celery/command.log looks a little suspect:

[2016-05-04 20:51:34,616: INFO/Worker-5] cumulus.ansible.tasks.cluster.provision_cluster[None]: changed: [52.39.154.233] => (item=[u'gridengine-client', u'gridengine-master', u'gridengine-exec']) => {"cache_update_time": 0, "cache_updated": false, "changed": true, "invocation": {"module_args": {"cache_valid_time": null, "deb": null, "default_release": null, "dpkg_options": "force-confdef,force-confold", "force": false, "install_recommends": null, "name": ["gridengine-client", "gridengine-master", "gridengine-exec"], "package": ["gridengine-client", "gridengine-master", "gridengine-exec"], "purge": false, "state": "present", "update_cache": false, "upgrade": null}, "module_name": "apt"}, "item": ["gridengine-client", "gridengine-master", "gridengine-exec"], "stderr": "..........................critical error: abort qmaster registration due to communication errors\n\ndaemonize error: child exited before sending daemonize state\nInitializing cluster with the following parameters:\n => SGE_ROOT: /var/lib/gridengine\n => SGE_CELL: default\n => Spool directory: /var/spool/gridengine/spooldb\n => Initial manager user: sgeadmin\nInitializing spool (/var/spool/gridengine/spooldb)\nInitializing global configuration based on /usr/share/gridengine/default-configuration\nInitializing complexes based on /usr/share/gridengine/centry\nInitializing usersets based on /usr/share/gridengine/usersets\nAdding user sgeadmin as a manager\nCluster creation complete\n", "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nSuggested packages:\n gridengine-qmon\nThe following NEW packages will be installed:\n gridengine-client gridengine-exec gridengine-master\n0 upgraded, 3 newly installed, 0 to remove and 4 not upgraded.\nNeed to get 2412 kB/6669 kB of archives.\nAfter this operation, 43.5 MB of additional disk space will be used.\nGet:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/universe gridengine-master amd64 6.2u5-7.3 [2412 kB]\nPreconfiguring packages ...\nFetched 2412 kB in 0s (16.6 MB/s)\nSelecting previously unselected package gridengine-client.\n(Reading database ... 58612 files and directories currently installed.)\nPreparing to unpack .../gridengine-client_6.2u5-7.3_amd64.deb ...\nUnpacking gridengine-client (6.2u5-7.3) ...\nSelecting previously unselected package gridengine-exec.\nPreparing to unpack .../gridengine-exec_6.2u5-7.3_amd64.deb ...\nUnpacking gridengine-exec (6.2u5-7.3) ...\nSelecting previously unselected package gridengine-master.\nPreparing to unpack .../gridengine-master_6.2u5-7.3_amd64.deb ...\nUnpacking gridengine-master (6.2u5-7.3) ...\nProcessing triggers for man-db (2.6.7.1-1ubuntu1) ...\nProcessing triggers for ureadahead (0.100.0-16) ...\nSetting up gridengine-client (6.2u5-7.3) ...\nSetting up gridengine-exec (6.2u5-7.3) ...\nSetting up gridengine-master (6.2u5-7.3) ...\nProcessing triggers for ureadahead (0.100.0-16) ...\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "Suggested packages:", " gridengine-qmon", "The following NEW packages will be installed:", " gridengine-client gridengine-exec gridengine-master", "0 upgraded, 3 newly installed, 0 to remove and 4 not upgraded.", "Need to get 2412 kB/6669 kB of archives.", "After this operation, 43.5 MB of additional disk space will be used.", "Get:1 http://us-west-2.ec2.archive.ubuntu.com/ubuntu/ trusty/universe gridengine-master amd64 6.2u5-7.3 [2412 kB]", "Preconfiguring packages ...", "Fetched 2412 kB in 0s (16.6 MB/s)", "Selecting previously unselected package gridengine-client.", "(Reading database ... 58612 files and directories currently installed.)", "Preparing to unpack .../gridengine-client_6.2u5-7.3_amd64.deb ...", "Unpacking gridengine-client (6.2u5-7.3) ...", "Selecting previously unselected package gridengine-exec.", "Preparing to unpack .../gridengine-exec_6.2u5-7.3_amd64.deb ...", "Unpacking gridengine-exec (6.2u5-7.3) ...", "Selecting previously unselected package gridengine-master.", "Preparing to unpack .../gridengine-master_6.2u5-7.3_amd64.deb ...", "Unpacking gridengine-master (6.2u5-7.3) ...", "Processing triggers for man-db (2.6.7.1-1ubuntu1) ...", "Processing triggers for ureadahead (0.100.0-16) ...", "Setting up gridengine-client (6.2u5-7.3) ...", "Setting up gridengine-exec (6.2u5-7.3) ...", "Setting up gridengine-master (6.2u5-7.3) ...", "Processing triggers for ureadahead (0.100.0-16) ..."]}

TristanWright avatar May 20 '16 20:05 TristanWright

Yep, that looks like the root cause, but I would have expected that the cluster would have move into the error state ...

cjh1 avatar May 20 '16 20:05 cjh1

After fixing the issue preventing error state from transitioning to anything else I haven't seen this. Closeable?

TristanWright avatar Sep 13 '16 22:09 TristanWright

I'm seeing this again. I have a few stuck in terminating, provisioning, or launching states.

screen shot 2017-03-07 at 11 12 49

In the case of "my new 52", the last one in that list, I can see the instances are still running in the console.

TristanWright avatar Mar 07 '17 18:03 TristanWright