cugraph icon indicating copy to clipboard operation
cugraph copied to clipboard

[BUG] Using `Client.wait_for_workers` Does Not Properly Wait for Workers

Open alexbarghi-nv opened this issue 1 year ago • 6 comments

While running benchmarks for the GNN packages in a multinode environment, @jnke2016 and I found that calling Client.wait_for_workers was not working properly, causing a hang or crash when running a dask workflow. Currently, we have a workaround that uses a separate script (wait_for_workers.py) to wait for all workers prior to launching a workflow. This workaround should be eliminated in favor of fixing the bug and calling Client.wait_for_workers as intended by the dask API.

alexbarghi-nv avatar Jan 09 '24 21:01 alexbarghi-nv

Possibly related to https://github.com/dask/distributed/pull/8314 ?

wence- avatar Jan 12 '24 16:01 wence-

Could be, I'll definitely test once that PR is merged.

alexbarghi-nv avatar Jan 12 '24 16:01 alexbarghi-nv

Not sure it will be, sorry. The approach I had there was not considered appropriate long term. I'll see if I can dig up the current state of any discussions

wence- avatar Jan 17 '24 08:01 wence-

The approach I had there was not considered appropriate long term. I'll see if I can dig up the current state of any discussions

@wence- , did you get any feedback?

jnke2016 avatar Feb 13 '24 18:02 jnke2016