cloud-ml-examples icon indicating copy to clipboard operation
cloud-ml-examples copied to clipboard

22.08 nightly container does not launch Dask scheduler properly

Open hcho3 opened this issue 3 years ago • 4 comments

The EC2 MNMG notebook currently uses the stable 21.06 container (rapidsai/rapidsai:21.06-cuda11.0-runtime-ubuntu18.04-py3.8) and is able to launch the EC2 cluster successfully.

However, when I replace it with the latest nightly container (rapidsai/rapidsai-nightly:22.08-cuda11.5-runtime-ubuntu20.04-py3.9), the EC2 cluster fails to launch. For some reason, the container fails to initialize the Dask container at port 8786. (I waited more than 3 hours and the scheduler still didn't come up at at 8786.)

TODO. Investigate why python -m distributed.cli.dask_scheduler fails on the latest nightly container.

hcho3 avatar Aug 10 '22 07:08 hcho3

I wonder if the environment variable DISABLE_JUPYTER needs to be set to true, the RAPIDS docker image might not be starting Dask at all if it is just blocking on Jupyter as the foreground process.

cluster = EC2Cluster(env_vars={"DISABLE_JUPYTER": "true", **get_aws_credentials()},
                     ...

xref https://github.com/rapidsai/docker/pull/425 but that change was done in January so I'm surprised we aren't seeing these issues in 22.06 too.

jacobtomlinson avatar Aug 10 '22 12:08 jacobtomlinson

The current notebook uses 21.06. When I switched to 22.06, I got the same issue.

hcho3 avatar Aug 11 '22 12:08 hcho3

Indeed, after setting DISABLE_JUPYTER=true, I observe the Dask scheduler launching successfully. I will incorporate this in my pull request. Thanks!

hcho3 avatar Aug 11 '22 12:08 hcho3

Ah yup, I misread your initial comment as 22.06, but if we are upgrading from 21.06 that makes a lot of sense.

jacobtomlinson avatar Aug 12 '22 14:08 jacobtomlinson

@hcho3 just going through old issues, can this be closed out now?

jacobtomlinson avatar Jan 13 '23 15:01 jacobtomlinson