dask-cloudprovider
dask-cloudprovider copied to clipboard
AzureVMCluster workers unable to connect to scheduler
What happened:
I'm just getting started with dask-cloudprovider, and looking to spin up an AzureVMCluster
to parallelise my dask workloads. I followed the documentation, to create the required infrastructure (resource group, network security group, vnet) except that, rather than allowing network traffic from the internet to ports 8786-8787, I restricted the IP to be the one of my local machine.
Following the minimal example in the documentation, i.e:
from dask_cloudprovider.azure import AzureVMCluster
cluster = AzureVMCluster(
location=LOCATION,
resource_group=RESOURCE_GROUP,
vnet=VNET,
security_group=SECURITY_GROUP,
n_workers=1
)
I find that I am able to successfully create the cluster, and am able to access the web dashboard (confirming that my network security group rule is working correctly). However, I find that my workers do not seem to be able to connect to the cluster. The only way I can appear to get them to connect is by creating an additional rule which allows access via the internet, as in the original example (obviously undesirable).
In the network security group, there are the default rules in place which allow incoming traffic across the vnet (which should cover the worker/scheduler connection). Adding a specific rule which allows traffic from the private ip range (10.0.0.0/24) doesn't help either.
(Strangely, if I create the "allow internet access" rule to allow the workers to be discovered by the scheduler, if I then remove the rule, the computation is still able to proceed, so I suspect there might be something strange happening with the worker discoverability?)
What you expected to happen: The workers should be able to connect to the scheduler without allowing unrestricted access to the vnet.
Environment:
- Dask cloud provider version: 2021.01.1
- Python version: 3.7
- Operating System: macOS 11
- Install method (conda, pip, source): pip
It's possible the workers are trying to communicate with the scheduler using a public IP address rather than an internal IP addr. I am not as familiar with Azure, but this is the case for GCP and AWS. There is a public_ingress
option which probably forces the dask cluster to be in the VPC.
Taking a step back, your desired configuration is a publicly addressable scheduler and dashboard with workers in the VPC, correct ?
Hi @quasiben, thanks for taking the time to respond.
It's possible the workers are trying to communicate with the scheduler using a public IP address rather than an internal IP addr.
Yes, I think you might be right with that one.
There is a public_ingress option which probably forces the dask cluster to be in the VPC.
Looking at the documentation: public_ingress: bool
Assign a public IP address to the scheduler. Default True.
Looks like this just creates the public IP – ~I guess I could set this to False
and assign the public ip to the scheduler manually through the Azure portal. I'll give that a try and see how I get on...~ In hindsight that's not a great idea...
Taking a step back, your desired configuration is a publicly addressable scheduler and dashboard with workers in the VPC, correct ?
Yep – that's right. 👍
We have had similar discussions on the GCP: https://github.com/dask/dask-cloudprovider/issues/229 https://github.com/dask/dask-cloudprovider/issues/215 .
And we track internal ips from GCP: https://github.com/dask/dask-cloudprovider/blob/dffd81d3e4ab6feb5378ffa0d9952f3c7da207f7/dask_cloudprovider/gcp/instances.py#L302-L305
My guess is that worker should be connecting to the scheduler with the private ip regardless of the state of the scheduler. Would you be up for exploring the code and azure and submitting a PR ? It would also be good for @jacobtomlinson to weigh in once he is back
Thanks for the context. I agree that it would be better to communicate via the private IP whenever possible. I'd be happy to take a look and see what I can do.
The SpecCluster
class already has support for setting an external_address
option for the scheduler ProcessInterface
.
https://github.com/dask/distributed/blob/7146449173c0329463ca39b98ba51087278853d5/distributed/deploy/spec.py#L310-L313
The idea being that you set self.address
to the private IP and self.external_address
to the public IP on the scheduler object. Workers will connect on the private address but the SpecCluster
class (in this case AzureVMCluster
) will connect on the external address.
Any clients which are then created using the cluster object will also pick up the external address.
I seem to have overlooked this functionality when implementing dask-cloudprovider
so perhaps we should review how things are implemented here. This may just resolve things for us. @quasiben did you look at this when implementing GCPCluster
?
I would like to take a look at this, as I have the use case of a publicly addressable scheduler and dashboard with workers in a vnet.
Would a good way to proceed be to implement AzureVMScheduler and AzureVMWorker with start_scheduler and start_worker methods, like the GCP implementation, and make sure the internal IP is used for the workers there, @jacobtomlinson ?