aztk icon indicating copy to clipboard operation
aztk copied to clipboard

Add resize cluster function

Open jafreck opened this issue 8 years ago • 8 comments

might only be possible to resize up

jafreck avatar Oct 31 '17 16:10 jafreck

What is the best workaround to resize today?

stefangordon avatar Feb 05 '18 20:02 stefangordon

The current recommendation is just to delete the cluster, and create a new one. This will change soon, though, this feature is high on our priority list.

There is another option that requires some knowledge and use of Azure Batch. You can navigate to the Batch pool through the Portal or Batch Labs and change the cluster's autoscale formula. If you need more information on how to do this, let me know.

jafreck avatar Feb 05 '18 21:02 jafreck

Thanks - happy to resize via batch just wasn't sure if there would be unintended consequences. I'll do some testing.

stefangordon avatar Feb 05 '18 21:02 stefangordon

Hi @stefangordon, I'd be interested in your findings. Were there any unintended consequences to resizing directly using Batch?

lachiemurray avatar Jun 19 '18 13:06 lachiemurray

Worked just fine.

stefangordon avatar Jun 19 '18 13:06 stefangordon

A word of caution here: the workaround of using Batch directly is only safe is you resize the cluster up (as in add additional nodes to a cluster). If you resize down, the node running the Spark master process might be deleted, leaving your cluster in a bad state.

jafreck avatar Jun 19 '18 20:06 jafreck

Would the proper solution involve creating 2 Batch pools? One for driver/master and another one for workers/executors? Then they can use different machine sizes and workers can be scaled independently.

shtratos avatar Jun 19 '18 20:06 shtratos

@shtratos That is a cool, but likely out of scope feature. We have done a POC and got that design to work, although here are a few caveats in that approach:

  1. a VNET becomes a requirement for all AZTK clusters
  2. cluster management becomes harder to debug
  3. cluster provisioning time increases or at least becomes more variable

At this time, we have no plans for implementing a AZTK clusters with multiple Batch pools.

The solution specifically that we are targeting for this issue is safe single-pool resize. We would not use Batch's autoscale for resizing down, but rather would delete individual nodes that are deemed safe (i.e. not the master, not running a driver).

jafreck avatar Jun 19 '18 21:06 jafreck