Add resize cluster function
might only be possible to resize up
What is the best workaround to resize today?
The current recommendation is just to delete the cluster, and create a new one. This will change soon, though, this feature is high on our priority list.
There is another option that requires some knowledge and use of Azure Batch. You can navigate to the Batch pool through the Portal or Batch Labs and change the cluster's autoscale formula. If you need more information on how to do this, let me know.
Thanks - happy to resize via batch just wasn't sure if there would be unintended consequences. I'll do some testing.
Hi @stefangordon, I'd be interested in your findings. Were there any unintended consequences to resizing directly using Batch?
Worked just fine.
A word of caution here: the workaround of using Batch directly is only safe is you resize the cluster up (as in add additional nodes to a cluster). If you resize down, the node running the Spark master process might be deleted, leaving your cluster in a bad state.
Would the proper solution involve creating 2 Batch pools? One for driver/master and another one for workers/executors? Then they can use different machine sizes and workers can be scaled independently.
@shtratos That is a cool, but likely out of scope feature. We have done a POC and got that design to work, although here are a few caveats in that approach:
- a VNET becomes a requirement for all AZTK clusters
- cluster management becomes harder to debug
- cluster provisioning time increases or at least becomes more variable
At this time, we have no plans for implementing a AZTK clusters with multiple Batch pools.
The solution specifically that we are targeting for this issue is safe single-pool resize. We would not use Batch's autoscale for resizing down, but rather would delete individual nodes that are deemed safe (i.e. not the master, not running a driver).