toil icon indicating copy to clipboard operation
toil copied to clipboard

incorrect ordering with nodeTypes and minNodes or maxNodes

Open tpesout opened this issue 8 years ago • 4 comments

I noticed when I used the multiple nodeType functionality in conjunction with minNodes and maxNodes, that the provisioner reversed the ordering. I ran:

toil-marginphase run --maxMemory 60G --maxCores 8 --batchSystem mesos --provisioner aws --nodeTypes m4.4xlarge,m4.2xlarge --nodeStorage 500 --maxNodes 10,3 --minNodes 10,3 aws:us-west-2:marginphase-oct17-6

..and found that 10 m4.2xlarge and 3 m4.4xlarge nodes were launched, whereas the reverse was expected based on the ordering in the --nodeTypes parameter.

I got the desired 10 4xl and 3 2xl instances when I ran this:

toil-marginphase run --restart --maxMemory 60G --maxCores 8 --batchSystem mesos --provisioner aws --nodeTypes m4.4xlarge,m4.2xlarge --nodeStorage 500 --maxNodes 3,10 --minNodes 3,10 aws:us-west-2:marginphase-oct17-6

I suspect that it has to do with the ordering of the minNodes parameter in the code (as the log said "raising minimum number of instances to 10"), but it might be in the maxNodes (or it might be something else).

Sorry I don't have time to investigate this more fully. If I do get a chance to, I'll update this Issue.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-211

tpesout avatar Oct 19 '17 23:10 tpesout

@adderan, it looks like minNodes and maxNodes should be keyed by nodeTypes rather than nodeShapes in clusterScaler.py. Does that seem right?

ejacox avatar Oct 24 '17 13:10 ejacox

➤ Adam Novak commented:

It could be that the parsing code is getting this wrong, or it could be that the Toil-integrated autoscaler is getting the right inputs and making the wrong scaling decisions and deciding to run the wrong nodes.

unito-bot avatar Jun 04 '24 17:06 unito-bot

➤ Adam Novak commented:

We don’t think this will affect launch-cluster since it takes a -w option with number ranges instead and hands those to the provisioner.

unito-bot avatar Jun 04 '24 17:06 unito-bot

➤ Adam Novak commented:

We’re hoping to deprecate the Toil-integrated autoscaler in favor of the Kubernetes-based one we can build straight into the cluster, so we might not ever actually get around to solving this.

unito-bot avatar Jun 04 '24 17:06 unito-bot