ray-on-aml icon indicating copy to clipboard operation
ray-on-aml copied to clipboard

Unable to initialize cluster

Open aforadi opened this issue 2 years ago • 2 comments

Hi,

Thank you for this library. We are trying to get this working from the example code in an interactive environment in Azure ML. The Jupyter notebook is a Python 3.8 Azure ML notebook.

from azureml.core import Workspace, Run, Environment
from ray_on_aml.core import Ray_On_AML
ws = Workspace.from_config()
ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ='ray-test', additional_pip_packages=['lightgbm_ray', 'sklearn'], maxnode=4)
ray = ray_on_aml.getRay(ci_is_head=False)

The image builds correctly on Azure ML. However, we receive the following error in the notebook.

Cancel active AML runs if any
Shutting down ray if any
Found existing cluster ray-test
Waiting cluster to start and return head node ip
..............................................................................................Cluster startup failed, check detail at run

And the following error inside the experiment:

Traceback (most recent call last):
  File "source_file.py", line 103, in <module>
    startRayMaster()
  File "source_file.py", line 31, in startRayMaster
    ip = socket.gethostbyname(socket.gethostname())
socket.gaierror: [Errno -3] Temporary failure in name resolution

This error comes with both True and False for ci_is_head.

All machines are inside the same VNET.

Let me know in case anything wrong with our setup or this is an issue with the library.

Thanks a lot!

aforadi avatar May 19 '22 10:05 aforadi

@james-tn any support would be helpful. Thanks!

aforadi avatar Jun 09 '22 14:06 aforadi

Hi, the library moved to https://github.com/microsoft/ray-on-aml So in your code in compute instance, can you do pip install --upgrade ray-on-aml then restart the kernel? Follow the example here: https://github.com/microsoft/ray-on-aml/blob/master/examples/quick_start_examples.ipynb

james-tn avatar Jun 09 '22 15:06 james-tn