toil icon indicating copy to clipboard operation
toil copied to clipboard

Replace Mesos-based automatically-deployed clusters and Toil-integrated scaling with Kubernetes-based automatically-deployed clusters running some kind of stock autoscaler

Open adamnovak opened this issue 5 years ago • 1 comments

We've been kind of stuck on https://github.com/DataBiosphere/toil/issues/2459 and https://github.com/DataBiosphere/toil/issues/2460 and https://github.com/DataBiosphere/toil/issues/2461, and we had to revert https://github.com/DataBiosphere/toil/pull/2715, because of problems getting Mesos to work correctly on Ubuntu 18.04 for the appliance.

With Ubuntu 20.04 out this month, we're getting worryingly behind. With Apache only shipping source tarballs and RPMs, with their last release in September, and the semi-official Mesosphere Dockers pretty far behind Apache (and also Mesosphere pivoting to become "d2iq"), it might be time to jump ship from Mesos to the wildly popular (and now passably supported by Toil) Kubernetes scheduler instead.

To complete the migration, we would have to adapt toil's cluster launcher to provision and tear down Kubernetes clusters instead of Mesos ones. A fixed-size cluster would be easy, but we also want to have the cluster autoscale, ideally with a stock Kubernetes horizontal autoscaler so we don't have to maintain our own anymore. Said autoscaler would need to know how to bid on the AWS spot market, and pass a preemptible flag through to some kind of node label that the Kubernetes batch system can key on when scheduling preemptible/nonpreemptible Toil jobs.

So far, I've looked at Canonical's Charmed Kubernetes, which uses their juju tool to deploy a Kubernetes cluster on any of a number of cloud providers. I'm not sure how we bolt autoscaling and spot market onto that, or how we get the Snap-only juju tool installable as a toil[kubernetes] dependency, but that might solve some of our provisioning problem.

┆Issue is synchronized with this Jira Story ┆Epic: jira:ucsc-cgl.atlassian.net:33260 ┆Issue Number: TOIL-542

adamnovak avatar Apr 15 '20 18:04 adamnovak

@adamnovak I agree we're getting behind with mesos stuck on ubuntu 16 and coreOS getting dropped.

I took a look at Charmed Kubernetes. The interface looks nice. It had a 20-30 minute start up time though with juju, which might not be ideal. I'm not sure if the 20-30 minute spin-up is true of all kubernetes boot ups?

It looks like we pay Canonical to buy/maintain/setup a Kubernetes cluster for us in whatever cloud we ask for (or on their premises), and we just go in with the kube config and run jobs, is what their information seemed to say? They seem to have restrictions on how things route and choice of software (I guess they like "flannel"), which might be a positive thing if they're doing things better. I suspect we could achieve the same thing by running terraform though? It doesn't have the nice browser interface but costs less and is more configurable.

If we switch and leave mesos in the dust, is there be a good way to launch a hello world toil job in AWS without the extra 20-30 minute overhead for users without access to an already running kube cluster?

Ubuntu 20.04 is coming out tomorrow now. I really hope Wayland has improved. >.< Guess we'll see when I reformat.

DailyDreaming avatar Apr 22 '20 20:04 DailyDreaming