dask-cuda icon indicating copy to clipboard operation
dask-cuda copied to clipboard

Problems using preload to get scheduler address

Open ntabris opened this issue 2 years ago • 11 comments

Normally we use Nanny class with a preload that gets the scheduler address (so that we can start EC2 instances for scheduler and workers at the same time, rather than waiting till scheduler EC2 instance has IP address before starting workers).

There are two issues I'm running into:

  1. dask-cuda requires the scheduler address to be set: https://github.com/rapidsai/dask-cuda/blob/main/dask_cuda/cuda_worker.py#L112-L120
  2. when dask-cuda instantiates Nanny it passes preload rather than preload_nanny kwarg. As far as I can tell, the preload kwarg on Nanny class doesn't do anything, it's stored as self.preload but then self.preload is set based on just preload_nanny. Relevant code: https://github.com/dask/distributed/blob/main/distributed/nanny.py#L162-L176

I've tested tweaks to not require scheduler address and to use preload_nanny kwarg, and I was able to start a cluster and have the workers successfully connect to scheduler using address that the preload fetches. Here's my fork with the tweaks:

https://github.com/rapidsai/dask-cuda/compare/main...ntabris:main?expand=1

ntabris avatar May 09 '22 20:05 ntabris

As far as I can tell, the preload kwarg on Nanny class doesn't do anything

This may not fully address this issue, but I wanted to point out that the preload_nanny= keyword for the Nanny class is used to specify the preload scripts to run on the nanny during it's startup process, while the preload= keyword argument to the Nanny class is used to specify preload scripts to run on the worker process that the nanny launches. Note when looking at the Nanny source code there is Nanny.preload for the worker preloads and Nanny.preloads (note the s) for the nanny preloads.

jrbourbeau avatar May 10 '22 02:05 jrbourbeau

Note when looking at the Nanny source code there is Nanny.preload for the worker preloads and Nanny.preloads (note the s) for the nanny preloads.

Got it, I missed the s! Thanks for the clarification!

I think that means the issue is that dask-cuda passes preloads to the worker and doesn't have a way to pass nanny preloads, and that's what (I think) we need. Perhaps the best solution is for dask-cuda take both preload and preload_nanny kwargs (+ argvs)?

ntabris avatar May 10 '22 02:05 ntabris

Perhaps the best solution is for dask-cuda take both preload and preload_nanny kwargs (+ argvs)?

Yeah, that seems like it would be consistent with what the distributed.Nanny class is already doing upstream. @pentschev @madsbk thoughts on if adding preload_nanny / preload_nanny_argv keywords to CUDAWorker seems like a sensible addition? @ntabris is this something you're interested in working on? (no obligation though)

jrbourbeau avatar May 10 '22 02:05 jrbourbeau

@pentschev @madsbk thoughts on if adding preload_nanny / preload_nanny_argv keywords to CUDAWorker seems like a sensible addition?

Yes, sounds like the right approach.

madsbk avatar May 10 '22 06:05 madsbk

@pentschev @madsbk thoughts on if adding preload_nanny / preload_nanny_argv keywords to CUDAWorker seems like a sensible addition?

Yes, sounds like the right approach.

Agreed, this looks like a reasonable solution. Alternatively, it seems possible to just use "distributed.nanny.preload" and "distributed.nanny.preload-argv" Dask configs as well.

pentschev avatar May 10 '22 09:05 pentschev

@ntabris are you able to move forward ?

quasiben avatar May 18 '22 11:05 quasiben

The issues aren't resolved, dask-cuda just hasn't been a priority at the moment and it sounds like I'll probably need to submit a PR if I want these things changed/fixed.

ntabris avatar May 18 '22 12:05 ntabris

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jun 17 '22 13:06 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Sep 15 '22 13:09 github-actions[bot]

Still an issue

quasiben avatar Sep 21 '22 13:09 quasiben

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Oct 21 '22 14:10 github-actions[bot]