distributed Always return `ws.address` from `_remove_from

Closes #6874

[x] Tests added / passed
[x] Passes pre-commit run --all-files

Aug 12 '22 14:08 hendrikmakait

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 17m 38s :stopwatch: - 48m 17s   3 000 tests ±0   2 911 :heavy_check_mark: +3     89 :zzz: ±0 0 :x: - 3 22 246 runs ±0 21 194 :heavy_check_mark: +2 1 052 :zzz: +1 0 :x: - 3

Results for commit 3ffcf711. ± Comparison against base commit 1d0701b5.

:recycle: This comment has been updated with latest results.

Aug 12 '22 15:08 github-actions[bot]

I think there are a couple of things we should verify here.

We should change stimulus_task_erred to be a no-op if the task-erred is not coming from the worker the task is actually processing on. We're already verifying that the task is still in processing so this check needs to be expanded https://github.com/dask/distributed/blob/1d0701b51c1dc354fb22af9b27b8f722f02daf63/distributed/scheduler.py#L4174-L4175

should really be

        if ts is None or ts.state != "processing" or ts.processing_on != worker:
            return {}, {}, {}

it's very likely this is not tested

Either drop the worker argument here or keep the worker, keep the worker in the signature and add an validation assert in transition_processing_erred to ensure everything is in order I think dropping it all is fine if stimulus_task_erred implements 1.)

Aug 17 '22 10:08 fjetter

I think there are a couple of things we should verify here.

Done.

I think dropping it all is fine if stimulus_task_erred implements 1.)

I have added the assertion step for now, better safe than sorry. It's odd enough we ran into the None case in the first place.

Aug 17 '22 11:08 hendrikmakait

We should change stimulus_task_erred to be a no-op if the task-erred is not coming from the worker the task is actually processing on

Did this happen? I'm not seeing it in the diff?

I also disagree that it should be a no-op. We currently support an unexpected worker successfully competing a task: https://github.com/dask/distributed/blob/c15a10e87ca5d03e62f0ad4f38adb63163522979/distributed/scheduler.py#L1987-L1993

so we should also support an unexpected worker encountering an error with a task. (I do think both cases probably indicate something is going wrong and should eventually be removed, I just don't see why we'd be inconsistent about the memory vs erred.)

Aug 23 '22 20:08 gjoseph92

Did this happen? I'm not seeing it in the diff?

It looks like I forgot to push that commit, it's on my local branch.

I also disagree that it should be a no-op. We currently support an unexpected worker successfully competing a task: [...] so we should also support an unexpected worker encountering an error with a task.

I think being inconsistent between memory and erred is fine. As you said, in both cases something is probably off. So if we were strict about handling here, we should probably ignore the task in both cases or treat these as errors. With memory, it's basically an optimization to say "Sweet, someone already did the work for us even though they were not supposed to, let's just take it". In the error case,, however, we already know something is off if a worker who isn't supposed to process a task keeps working on it, so something going wrong with the task could be expected and this shouldn't count toward the retry-limit on healthy workers.

Aug 24 '22 05:08 hendrikmakait

distributed
distributed copied to clipboard

Always return `ws.address` from `_remove_from_processing`

Unit Test Results

distributed distributed copied to clipboard

Always return `ws.address` from `_remove_from_processing`

Unit Test Results

distributed
distributed copied to clipboard