distributed Cluster wait

Closes #6346

[x] Tests added / passed
[x] Passes pre-commit run --all-files

Jul 09 '22 16:07 idorrington92

Can one of the admins verify this patch?

Jul 09 '22 16:07 GPUtester

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      26 files ±  0       26 suites ±0 12h 50m 9s :stopwatch: - 8m 54s   3 552 tests +  1   3 444 :heavy_check_mark: -   2   105 :zzz: ±0 2 :x: +2 1 :fire: +1 44 952 runs +13 42 826 :heavy_check_mark: +13 2 121 :zzz: - 5 4 :x: +4 1 :fire: +1

For more details on these failures and errors, see this check.

Results for commit c95aafd9. ± Comparison against base commit 1baa5ffb.

:recycle: This comment has been updated with latest results.

Jul 09 '22 17:07 github-actions[bot]

add to allowlist

Jul 09 '22 17:07 quasiben

add to allowlist

@quasiben Sorry, what does this mean?

Also, I've got 3 failing tests in test_ssh. Is it possible to run these locally? They just fail for me, saying cluster failed to start. This happens on main branch as well. I'm not really sure how to debug them otherwise...

Jul 12 '22 21:07 idorrington92

@idorrington92 the comment allows us to run your PR against a GPU CI system: https://docs.dask.org/en/stable/develop.html#gpu-ci

Distributed has been flaky in the past and recently there has been a concerted effort to stabilize the CI test: https://dask.github.io/distributed/test_report.html

As you can see in the linked report, test_dashboard_port_zero is a known flaky test

I would suggest @jacobtomlinson review this PR but he is currently at SciPy (as are other dask maintainers) so it may be a little bit before you hear back

Jul 12 '22 21:07 quasiben

Doesn't it say listing/pre-commit hooks passed ok?

I agree that test failure is related but I can't run it locally as I get an error saying cluster failed to start.

Is there something I need to do in order to run it locally? Otherwise the only thing I've got to go on is the automated tests here...

Jul 25 '22 10:07 idorrington92

Apologies I was looking at Linting / Unit Test Results (pull_request) which actually has nothing to do with linting, sorry for the noise.

Can you run ssh localhost without having to enter a password?

Jul 25 '22 10:07 jacobtomlinson

Thanks @jacobtomlinson, that fixed the error I was getting. I would have taken ages to think of that.

I had a little look at the bug that was coming up in the tests, but it's not obvious to me why it's happening. I'll have proper look later in the week.

Jul 25 '22 21:07 idorrington92

@jacobtomlinson can I get some help please?

I managed to fix those failing tests by calling client.scale in client.wait_for_workers. But when I merged in master, the test I wrote to check the cluster.wait_for_worker method started failing. It waits forever for the workers because they never change their status to running, they just get stuck on init.

Do you have any ideas why this would be happening? No worries if not, I'll keep digging.

Aug 09 '22 22:08 idorrington92

I'm not sure why that is happening sorry.

Aug 10 '22 13:08 jacobtomlinson

I think it's all working properly now. I had to create a function to get the worker status and update the scheduler_info. As it was, scheduler_info was only updated when the workers were added, which was why it looked like they were stuck with status 'init'. The new function updates the scheduler_info with the latest worker status.

One test is failing on python 3.8, but I think that's nothing to do with my changes.

Aug 14 '22 21:08 idorrington92

Hi @jacobtomlinson and @fjetter, just pinging you to make sure this isn't forgotten about :)

I've merged main branch into this branch, and copied the new behaviour for when there are no workers

Sep 08 '22 08:09 idorrington92

Can we ping someone from Coiled to review this? It's been months now and would be nice to merge this

Dec 26 '22 06:12 idorrington92

Hi @idorrington92 sorry that this hasn't gotten much love. I suspect that most folks are out during the holidays. I'm holding down the fort for the moment.

If I were reviewing this initially I probably would have said "the try-except logic around cluster.wait_for_workers seems super simple and good to me 👍 . However, the implementation for Cluster.wait_for_workers and the changes to update_scheduler_info seem also probably good, but like an area that might be error prone. Do we actually want/need this for some reason? If not I'd be inclined to skip this and stay with status quo. It seems to me like the upstream issue was mostly asking for the ability of downstream clusters to have their wait_for_worker methods respected, not tha we implement one for all Cluster classes.

My preference is to drop the implementation (even if this means dropping the test), and just stick with the try-except logic that you have in client.py (which seems super straightforward). However, I appreciate that you've also gone through several rounds of review, and having expectations change on you after so long probably isn't fun.

Do you have any thoughts on the above? Is there a solid reason to implement wait_for_workers and update_scheduler_info that I'm not seeing here?

Dec 26 '22 20:12 mrocklin

Said a different way, I'm more than happy to merge the client.py changes immediately. I'd want to look over the other changes more thoroughly, which would take me some time. Or somoene like @jacobtomlinson who has already looked over them could merge if he's feeling confident.

Dec 26 '22 20:12 mrocklin

Hi @idorrington92 sorry that this hasn't gotten much love. I suspect that most folks are out during the holidays. I'm holding down the fort for the moment.

If I were reviewing this initially I probably would have said "the try-except logic around cluster.wait_for_workers seems super simple and good to me +1 . However, the implementation for Cluster.wait_for_workers and the changes to update_scheduler_info seem also probably good, but like an area that might be error prone. Do we actually want/need this for some reason? If not I'd be inclined to skip this and stay with status quo. It seems to me like the upstream issue was mostly asking for the ability of downstream clusters to have their wait_for_worker methods respected, not tha we implement one for all Cluster classes.

My preference is to drop the implementation (even if this means dropping the test), and just stick with the try-except logic that you have in client.py (which seems super straightforward). However, I appreciate that you've also gone through several rounds of review, and having expectations change on you after so long probably isn't fun.

Do you have any thoughts on the above? Is there a solid reason to implement wait_for_workers and update_scheduler_info that I'm not seeing here?

Hi @mrocklin, Thank you for your quick reply :) I see what you mean. I took the issue quite literally and just implemented the changes that were requested. But perhaps a simple try-catch in client.py would allow clusters to do everything the issue really needs. I've not used clusters much, so don't know from a users perspective whether cluster.wait_for_workers is very useful.

FYI - Following the comment from @fjetter above, I've replaced that try-catch with an if-else, so that will need to be reverted if we do go down that route.

I'm happy either way, I learned a lot from working on this issue, and would rather remove most of my changes and contribute something useful than add a load of code that'll cause problems later on :)

I think it's better if people with more user knowledge than me decide which route we go down though :)

Jan 19 '23 17:01 idorrington92

Just pinging again to see if we can merge this now?

Apr 05 '23 15:04 idorrington92

I've handled the merge conflicts. I'm getting some failing tests, but I'm seeing these test failures in other PRs as well so don't think it's to do with my changes...

Apr 10 '23 15:04 idorrington92

It looks like distributed/deploy/tests/test_cluster.py::test_cluster_wait_for_worker is failing a bunch here so it's certainly related to these changes. Would you mind taking another look?

Apr 20 '23 11:04 jacobtomlinson

This was a breaking change for dask-gateway's client i think, because it assumes that its available instead of opting in to it if it is.

Can we complement this PR with a conditional check to verify the new function is available or similar? Possibly emitting a warning or similar?

If this is confirmed to being a resonable call, i can open a PR but since I'm a novice in this repo's code base it would be good to have a signal it could make sense at all.

Dec 31 '23 13:12 consideRatio

distributed distributed copied to clipboard

Cluster wait

Unit Test Results

distributed
distributed copied to clipboard