distributed
distributed copied to clipboard
Always return `ws.address` from `_remove_from_processing`
Closes #6874
- [x] Tests added / passed
- [x] Passes
pre-commit run --all-files
Unit Test Results
See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.
15 files ±0 15 suites ±0 6h 17m 38s :stopwatch: - 48m 17s 3 000 tests ±0 2 911 :heavy_check_mark: +3 89 :zzz: ±0 0 :x: - 3 22 246 runs ±0 21 194 :heavy_check_mark: +2 1 052 :zzz: +1 0 :x: - 3
Results for commit 3ffcf711. ± Comparison against base commit 1d0701b5.
:recycle: This comment has been updated with latest results.
I think there are a couple of things we should verify here.
- We should change
stimulus_task_erredto be a no-op if the task-erred is not coming from the worker the task is actually processing on. We're already verifying that the task is still in processing so this check needs to be expanded https://github.com/dask/distributed/blob/1d0701b51c1dc354fb22af9b27b8f722f02daf63/distributed/scheduler.py#L4174-L4175
should really be
if ts is None or ts.state != "processing" or ts.processing_on != worker:
return {}, {}, {}
it's very likely this is not tested
- Either drop the worker argument here or keep the worker, keep the worker in the signature and add an validation assert in transition_processing_erred to ensure everything is in order
I think dropping it all is fine if
stimulus_task_erredimplements 1.)
I think there are a couple of things we should verify here.
Done.
I think dropping it all is fine if stimulus_task_erred implements 1.)
I have added the assertion step for now, better safe than sorry. It's odd enough we ran into the None case in the first place.
- We should change
stimulus_task_erredto be a no-op if the task-erred is not coming from the worker the task is actually processing on
Did this happen? I'm not seeing it in the diff?
I also disagree that it should be a no-op. We currently support an unexpected worker successfully competing a task: https://github.com/dask/distributed/blob/c15a10e87ca5d03e62f0ad4f38adb63163522979/distributed/scheduler.py#L1987-L1993
so we should also support an unexpected worker encountering an error with a task. (I do think both cases probably indicate something is going wrong and should eventually be removed, I just don't see why we'd be inconsistent about the memory vs erred.)
Did this happen? I'm not seeing it in the diff?
It looks like I forgot to push that commit, it's on my local branch.
I also disagree that it should be a no-op. We currently support an unexpected worker successfully competing a task: [...] so we should also support an unexpected worker encountering an error with a task.
I think being inconsistent between memory and erred is fine. As you said, in both cases something is probably off. So if we were strict about handling here, we should probably ignore the task in both cases or treat these as errors. With memory, it's basically an optimization to say "Sweet, someone already did the work for us even though they were not supposed to, let's just take it". In the error case,, however, we already know something is off if a worker who isn't supposed to process a task keeps working on it, so something going wrong with the task could be expected and this shouldn't count toward the retry-limit on healthy workers.