toil
toil copied to clipboard
Cluster scaler terminates entirely on a spot request failure
Currently the AWS provisioner will terminate the entire workflow if it hits the spot request limit:
Exception in thread preemptable-scaler:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/dist-packages/bd2k/util/threading.py", line 51, in run
self.tryRun( )
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 462, in tryRun
preemptable=self.preemptable)
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/abstractProvisioner.py", line 177, in setNodeCount
preemptable=preemptable)
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 581, in _addNodes
tentative=True)
File "/usr/local/lib/python2.7/dist-packages/cgcloud/lib/ec2.py", line 356, in create_spot_instances
requests = ec2.request_spot_instances( price, image_id, count=num_instances, **spec )
File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 1638, in request_spot_instances
verb='POST')
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1186, in get_list
raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>MaxSpotInstanceCountExceeded</Code><Message>Max spot instance count exceeded</Message></Error></Errors><RequestID>e0d171f5-f10d-4319-b3c3-6571fb2a9462</RequestID></Response>
EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>MaxSpotInstanceCountExceeded</Code><Message>Max spot instance count exceeded</Message></Error></Errors><RequestID>e0d171f5-f10d-4319-b3c3-6571fb2a9462</RequestID></Response>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 292, in check
scalerThread.join(timeout=0)
File "/usr/local/lib/python2.7/dist-packages/bd2k/util/threading.py", line 51, in run
self.tryRun( )
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 462, in tryRun
preemptable=self.preemptable)
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/abstractProvisioner.py", line 177, in setNodeCount
preemptable=preemptable)
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 581, in _addNodes
tentative=True)
File "/usr/local/lib/python2.7/dist-packages/cgcloud/lib/ec2.py", line 356, in create_spot_instances
requests = ec2.request_spot_instances( price, image_id, count=num_instances, **spec )
File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 1638, in request_spot_instances
verb='POST')
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1186, in get_list
raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>MaxSpotInstanceCountExceeded</Code><Message>Max spot instance count exceeded</Message></Error></Errors><RequestID>e0d171f5-f10d-4319-b3c3-6571fb2a9462</RequestID></Response>
Waiting for workers to shutdown
Forcing provisioner to reduce cluster size to zero.
I'd suggest that instead we just drop a warning, (possibly) decrease the number of requested instances, and keep trying without killing the workflow. Users might easily go over their limit without realizing it, especially if they share AWS accounts or have a new AWS account. Unfortunately I can't submit a patch for this, because I can't test if it works, because my spot limit is 0 thanks to the AWS account reshuffle (not that I'm bitter about that :)).
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-169
When this error occurs, Toil does not stop the running cluster nodes (see #2196). That makes this bug extremely dangerous.
➤ Melaina Legaspi commented:
Marking this ticket as low priority, we haven’t addressed this in many years.
➤ Melaina Legaspi commented:
Adam Novak :"This needs to be reproduced and the best approach would be to mock the spot market.”