StarCluster
StarCluster copied to clipboard
eventual consistency race condition bug during addnodes
starcluster addnode acluster2 -n 35 StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6) Software Tools for Academics and Researchers (STAR) Please submit bug reports to [email protected]
Launching node(s): node022, node023, node024, node025, node026, node027, node028, node029, node030, node031, node032, node033, node034, node035, node036, node037, node038, node039, node040, node041, node042, node043, node044, node045, node046, node047, node048, node049, node050, node051, node052, node053, node054, node055, node056 Reservation:r-f3411f17 Waiting for instances to propagate... 35/35 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% !!! ERROR - InvalidInstanceID.NotFound: The instance IDs 'i-10e7c1bd, i-0ae7c1a7, i-01e7c1ac, i-37e7c19a, i-12e7c1bf, i-00e7c1ad, i-13e7c1be, i-36e7c19b, i-35e7c198, i-34e7c199, i-1de7c1b0, i-1ce7c1b1, i-03e7c1ae, i-02e7c1af, i-1fe7c1b2, i-1ee7c1b3, i-30e7c19d, i-23e7c18e, i-0de7c1a0, i-0ce7c1a1, i-22e7c18f, i-19e7c1b4, i-0fe7c1a2, i-0ee7c1a3, i-09e7c1a4, i-08e7c1a5, i-0be7c1a6, i-17e7c1ba, i-14e7c1b9, i-11e7c1bc, i-16e7c1bb, i-1be7c1b6, i-18e7c1b5, i-15e7c1b8, i-1ae7c1b7' do not exist Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/starcluster/cli.py", line 274, in main sc.execute(args) File "/usr/local/lib/python2.7/dist-packages/starcluster/commands/addnode.py", line 128, in execute no_create=self.opts.no_create) File "/usr/local/lib/python2.7/dist-packages/starcluster/cluster.py", line 189, in add_nodes no_create=no_create) File "/usr/local/lib/python2.7/dist-packages/starcluster/cluster.py", line 1037, in add_nodes self.ec2.wait_for_propagation(instances=resp[0].instances) File "/usr/local/lib/python2.7/dist-packages/starcluster/awsutils.py", line 602, in wait_for_propagation 'instances', max_retries=max_retries, interval=interval) File "/usr/local/lib/python2.7/dist-packages/starcluster/awsutils.py", line 562, in _wait_for_propagation reqs = fetch_func(filters=filters) File "/usr/local/lib/python2.7/dist-packages/starcluster/awsutils.py", line 746, in get_all_instances filters=filters) File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 575, in get_all_instances max_results=max_results) File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 656, in get_all_reservations [('item', Reservation)], verb='POST') File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1141, in get_list raise self.ResponseError(response.status, response.reason, body) EC2ResponseError: EC2ResponseError: 400 Bad Request
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance IDs 'i-10e7c1bd, i-0ae7c1a7, i-01e7c1ac, i-37e7c19a, i-12e7c1bf, i-00e7c1ad, i-13e7c1be, i-36e7c19b, i-35e7c198, i-34e7c199, i-1de7c1b0, i-1ce7c1b1, i-03e7c1ae, i-02e7c1af, i-1fe7c1b2, i-1ee7c1b3, i-30e7c19d, i-23e7c18e, i-0de7c1a0, i-0ce7c1a1, i-22e7c18f, i-19e7c1b4, i-0fe7c1a2, i-0ee7c1a3, i-09e7c1a4, i-08e7c1a5, i-0be7c1a6, i-17e7c1ba, i-14e7c1b9, i-11e7c1bc, i-16e7c1bb, i-1be7c1b6, i-18e7c1b5, i-15e7c1b8, i-1ae7c1b7' do not exist</Message></Error></Errors><RequestID>208987ad-2c09-4083-8e40-d070abb70508</RequestID></Response>
I've been bitten by this more than once.
yes it can be resolved by manually running addnode -x -a node022,node023,...node056 pbcorncluster2 but that's tedious and error prone.
https://github.com/jtriley/StarCluster/pull/463/files @cariaso This is a pretty heavy PR, but it improves node addition as well as fixing that bug.