toil
toil copied to clipboard
incorrect ordering with nodeTypes and minNodes or maxNodes
I noticed when I used the multiple nodeType functionality in conjunction with minNodes and maxNodes, that the provisioner reversed the ordering. I ran:
toil-marginphase run --maxMemory 60G --maxCores 8 --batchSystem mesos --provisioner aws --nodeTypes m4.4xlarge,m4.2xlarge --nodeStorage 500 --maxNodes 10,3 --minNodes 10,3 aws:us-west-2:marginphase-oct17-6
..and found that 10 m4.2xlarge and 3 m4.4xlarge nodes were launched, whereas the reverse was expected based on the ordering in the --nodeTypes parameter.
I got the desired 10 4xl and 3 2xl instances when I ran this:
toil-marginphase run --restart --maxMemory 60G --maxCores 8 --batchSystem mesos --provisioner aws --nodeTypes m4.4xlarge,m4.2xlarge --nodeStorage 500 --maxNodes 3,10 --minNodes 3,10 aws:us-west-2:marginphase-oct17-6
I suspect that it has to do with the ordering of the minNodes parameter in the code (as the log said "raising minimum number of instances to 10"), but it might be in the maxNodes (or it might be something else).
Sorry I don't have time to investigate this more fully. If I do get a chance to, I'll update this Issue.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-211
@adderan, it looks like minNodes and maxNodes should be keyed by nodeTypes rather than nodeShapes in clusterScaler.py. Does that seem right?
➤ Adam Novak commented:
It could be that the parsing code is getting this wrong, or it could be that the Toil-integrated autoscaler is getting the right inputs and making the wrong scaling decisions and deciding to run the wrong nodes.
➤ Adam Novak commented:
We don’t think this will affect launch-cluster since it takes a -w option with number ranges instead and hands those to the provisioner.
➤ Adam Novak commented:
We’re hoping to deprecate the Toil-integrated autoscaler in favor of the Kubernetes-based one we can build straight into the cluster, so we might not ever actually get around to solving this.