distributed
distributed copied to clipboard
Cluster wait
Closes #6346
- [x] Tests added / passed
- [x] Passes
pre-commit run --all-files
Can one of the admins verify this patch?
Unit Test Results
See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.
26 files ± 0 26 suites ±0 12h 50m 9s :stopwatch: - 8m 54s 3 552 tests + 1 3 444 :heavy_check_mark: - 2 105 :zzz: ±0 2 :x: +2 1 :fire: +1 44 952 runs +13 42 826 :heavy_check_mark: +13 2 121 :zzz: - 5 4 :x: +4 1 :fire: +1
For more details on these failures and errors, see this check.
Results for commit c95aafd9. ± Comparison against base commit 1baa5ffb.
:recycle: This comment has been updated with latest results.
add to allowlist
add to allowlist
@quasiben Sorry, what does this mean?
Also, I've got 3 failing tests in test_ssh. Is it possible to run these locally? They just fail for me, saying cluster failed to start. This happens on main branch as well. I'm not really sure how to debug them otherwise...
@idorrington92 the comment allows us to run your PR against a GPU CI system: https://docs.dask.org/en/stable/develop.html#gpu-ci
Distributed has been flaky in the past and recently there has been a concerted effort to stabilize the CI test: https://dask.github.io/distributed/test_report.html
As you can see in the linked report, test_dashboard_port_zero
is a known flaky test
I would suggest @jacobtomlinson review this PR but he is currently at SciPy (as are other dask maintainers) so it may be a little bit before you hear back
Doesn't it say listing/pre-commit hooks passed ok?
I agree that test failure is related but I can't run it locally as I get an error saying cluster failed to start.
Is there something I need to do in order to run it locally? Otherwise the only thing I've got to go on is the automated tests here...
Apologies I was looking at Linting / Unit Test Results (pull_request)
which actually has nothing to do with linting, sorry for the noise.
Can you run ssh localhost
without having to enter a password?
Thanks @jacobtomlinson, that fixed the error I was getting. I would have taken ages to think of that.
I had a little look at the bug that was coming up in the tests, but it's not obvious to me why it's happening. I'll have proper look later in the week.
@jacobtomlinson can I get some help please?
I managed to fix those failing tests by calling client.scale in client.wait_for_workers. But when I merged in master, the test I wrote to check the cluster.wait_for_worker method started failing. It waits forever for the workers because they never change their status to running, they just get stuck on init.
Do you have any ideas why this would be happening? No worries if not, I'll keep digging.
I'm not sure why that is happening sorry.
I think it's all working properly now. I had to create a function to get the worker status and update the scheduler_info. As it was, scheduler_info was only updated when the workers were added, which was why it looked like they were stuck with status 'init'. The new function updates the scheduler_info with the latest worker status.
One test is failing on python 3.8, but I think that's nothing to do with my changes.
Hi @jacobtomlinson and @fjetter, just pinging you to make sure this isn't forgotten about :)
I've merged main branch into this branch, and copied the new behaviour for when there are no workers
Can we ping someone from Coiled to review this? It's been months now and would be nice to merge this
Hi @idorrington92 sorry that this hasn't gotten much love. I suspect that most folks are out during the holidays. I'm holding down the fort for the moment.
If I were reviewing this initially I probably would have said "the try-except logic around cluster.wait_for_workers
seems super simple and good to me 👍 . However, the implementation for Cluster.wait_for_workers
and the changes to update_scheduler_info
seem also probably good, but like an area that might be error prone. Do we actually want/need this for some reason? If not I'd be inclined to skip this and stay with status quo. It seems to me like the upstream issue was mostly asking for the ability of downstream clusters to have their wait_for_worker
methods respected, not tha we implement one for all Cluster
classes.
My preference is to drop the implementation (even if this means dropping the test), and just stick with the try-except logic that you have in client.py
(which seems super straightforward). However, I appreciate that you've also gone through several rounds of review, and having expectations change on you after so long probably isn't fun.
Do you have any thoughts on the above? Is there a solid reason to implement wait_for_workers
and update_scheduler_info
that I'm not seeing here?
Said a different way, I'm more than happy to merge the client.py changes immediately. I'd want to look over the other changes more thoroughly, which would take me some time. Or somoene like @jacobtomlinson who has already looked over them could merge if he's feeling confident.
Hi @idorrington92 sorry that this hasn't gotten much love. I suspect that most folks are out during the holidays. I'm holding down the fort for the moment.
If I were reviewing this initially I probably would have said "the try-except logic around
cluster.wait_for_workers
seems super simple and good to me +1 . However, the implementation forCluster.wait_for_workers
and the changes toupdate_scheduler_info
seem also probably good, but like an area that might be error prone. Do we actually want/need this for some reason? If not I'd be inclined to skip this and stay with status quo. It seems to me like the upstream issue was mostly asking for the ability of downstream clusters to have theirwait_for_worker
methods respected, not tha we implement one for allCluster
classes.My preference is to drop the implementation (even if this means dropping the test), and just stick with the try-except logic that you have in
client.py
(which seems super straightforward). However, I appreciate that you've also gone through several rounds of review, and having expectations change on you after so long probably isn't fun.Do you have any thoughts on the above? Is there a solid reason to implement
wait_for_workers
andupdate_scheduler_info
that I'm not seeing here?
Hi @mrocklin, Thank you for your quick reply :) I see what you mean. I took the issue quite literally and just implemented the changes that were requested. But perhaps a simple try-catch in client.py would allow clusters to do everything the issue really needs. I've not used clusters much, so don't know from a users perspective whether cluster.wait_for_workers is very useful.
FYI - Following the comment from @fjetter above, I've replaced that try-catch with an if-else, so that will need to be reverted if we do go down that route.
I'm happy either way, I learned a lot from working on this issue, and would rather remove most of my changes and contribute something useful than add a load of code that'll cause problems later on :)
I think it's better if people with more user knowledge than me decide which route we go down though :)
Just pinging again to see if we can merge this now?
I've handled the merge conflicts. I'm getting some failing tests, but I'm seeing these test failures in other PRs as well so don't think it's to do with my changes...
It looks like distributed/deploy/tests/test_cluster.py::test_cluster_wait_for_worker
is failing a bunch here so it's certainly related to these changes. Would you mind taking another look?
This was a breaking change for dask-gateway's client i think, because it assumes that its available instead of opting in to it if it is.
Can we complement this PR with a conditional check to verify the new function is available or similar? Possibly emitting a warning or similar?
If this is confirmed to being a resonable call, i can open a PR but since I'm a novice in this repo's code base it would be good to have a signal it could make sense at all.