nebari icon indicating copy to clipboard operation
nebari copied to clipboard

[ENH] - Add ability to spinup dask workers in a single AZ (AWS)

Open aktech opened this issue 3 years ago • 3 comments

Feature description

Ability to spin up dask workers in a single availability Zone in AWS.

Value and/or benefit

While running data intensive tasks via dask workers, it happens quite often that dask workers are spun up in various AZs (Availability zones), which can cause lot of data transfer across AZs, which is not very cheap.

Having this ability will make spinning up large number of dask workers very cost efficient.

Anything else?

No response

aktech avatar Aug 03 '22 14:08 aktech

Is this related to/fixed by:

https://docs.qhub.dev/en/stable/source/admin_guide/faq.html?highlight=availability#on-aws-why-do-user-instances-occasionally-die-30-minutes-after-spinning-up-a-large-dask-cluster

dharhas avatar Aug 16 '22 13:08 dharhas

@dharhas I believe that FAQ fixes another issue. I tried making the change that was suggested and new nodes are still split between the two AZs.

From my perspective, there's a potential short-term solution and a long-term solution that will require a potential update to how we create AWS node-groups.

short term solution

Disable one of the network subnets for the associated AutoScaling group.

  • To perform this action, on the AWS console, navigate to EC2 > Auto Scaling Groups and select the appropriate auto-scaling group.
  • Under Network, remove all but one subnet. This will force all new nodes to spin up using that subnet (and subsequently only in one AZ).

This workaround has the drawback that the associated node-group will raise a "Health Issue":

  • AutoScalingGroupInvalidConfiguration - it wants two subnets in seperate AZs

long term solution

I believe the long term solution is to have an option to force the node-group to run in a single subnet (ie single AZ). An initial attempt at this solution can be found on the aws_single_subnet branch.

iameskild avatar Sep 08 '22 01:09 iameskild

I tested the "long term solution" (on branch aws_single_subnet) and from what I can tell, all of the nodes in the worker node-group spawned in a single AZ (provided that the key single_subnet = true was set in the node-group section). It's probably worth testing this a little more to ensure there are no other unintended consequences.

iameskild avatar Sep 08 '22 03:09 iameskild