distributed icon indicating copy to clipboard operation
distributed copied to clipboard

`AssertionError` in `WorkerState._transition_cancelled_error`

Open hendrikmakait opened this issue 3 years ago • 3 comments
trafficstars

Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Exception in callback Worker._handle_stimulus_from_task(<Task finishe... to SSL")\n')>) at /opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py:1844
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: handle: <Handle Worker._handle_stimulus_from_task(<Task finishe... to SSL")\n')>) at /opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py:1844>
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/asyncio/events.py", line 80, in _run
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     self._context.run(self._callback, *self._args)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 193, in wrapper
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return method(self, *args, **kwargs)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1854, in _handle_stimulus_from_task
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     super()._handle_stimulus_from_task(task)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 3341, in _handle_stimulus_from_task
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     self.handle_stimulus(stim)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 193, in wrapper
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return method(self, *args, **kwargs)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1866, in handle_stimulus
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     super().handle_stimulus(*stims)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 3354, in handle_stimulus
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     instructions = self.state.handle_stimulus(*stims)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 1279, in handle_stimulus
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     instructions += self._transitions(recs, stimulus_id=stim.stimulus_id)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 2531, in _transitions
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     process_recs(recommendations.copy())
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 2525, in process_recs
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     a_recs, a_instructions = self._transition(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 2443, in _transition
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     recs, instructions = func(self, ts, *args, stimulus_id=stimulus_id)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 1921, in _transition_cancelled_error
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     assert ts.previous in ("executing", "long-running")
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: AssertionError

(https://cloud.coiled.io/dask-engineering/clusters/48635/details on hendrik-debug-worker-057e4c65bd)

Notably, the following log messages directly preceded the error:

Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: 2022-08-11 10:00:05,215 - distributed.core - INFO - Event loop was unresponsive in Worker for 45.99s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: 2022-08-11 10:00:05,225 - distributed.worker - INFO - Stopping worker at tls://10.0.3.173:42801
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: 2022-08-11 10:00:05,240 - distributed.core - ERROR - Exception while handling op get_data
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 223, in read
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     frames_nbytes = await stream.read_bytes(fmt_size)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: tornado.iostream.StreamClosedError: Stream is closed
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: During handling of the above exception, another exception occurred:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 769, in _handle_comm
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     result = await result
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     response = await comm.read(deserializers=serializers)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     convert_stream_closed_error(self, e)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     if "UNKNOWN_CA" in exc.reason:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: TypeError: argument of type 'NoneType' is not iterable
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: 2022-08-11 10:00:05,268 - distributed.worker - ERROR - IOStream is not idle; cannot convert to SSL
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 1362, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     await done.wait()
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/asyncio/locks.py", line 214, in wait
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     await fut
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: asyncio.exceptions.CancelledError
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: During handling of the above exception, another exception occurred:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     response = await get_data_from_worker(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return await retry_operation(_get_data, operation="get_data_from_worker")
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return await retry(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return await coro()
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     comm = await rpc.connect(worker)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 1367, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     await connect_attempt
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     comm = await connect(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/core.py", line 291, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     comm = await asyncio.wait_for(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return fut.result()
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 449, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     stream = await self.client.connect(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/tornado/tcpclient.py", line 288, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     stream = await stream.start_tls(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/tornado/iostream.py", line 1270, in start_tls
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     raise ValueError("IOStream is not idle; cannot convert to SSL")
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: ValueError: IOStream is not idle; cannot convert to SSL
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Task exception was never retrieved
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: future: <Task finished name='Task-338' coro=<Server._handle_comm() done, defined at /opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py:675> exception=TypeError("argument of type 'NoneType' is not iterable")>
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 223, in read
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     frames_nbytes = await stream.read_bytes(fmt_size)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: tornado.iostream.StreamClosedError: Stream is closed
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: During handling of the above exception, another exception occurred:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 769, in _handle_comm
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     result = await result
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     response = await comm.read(deserializers=serializers)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     convert_stream_closed_error(self, e)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     if "UNKNOWN_CA" in exc.reason:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: TypeError: argument of type 'NoneType' is not iterable

hendrikmakait avatar Aug 11 '22 14:08 hendrikmakait

cc @crusaderky

hendrikmakait avatar Aug 11 '22 14:08 hendrikmakait

ts.previous == "flight". This is tightly related to #6705: The network stack raised, which caused gather_dep to return GatherDepFailureEvent, which is quite uncommon in real life and virtually untested.

crusaderky avatar Aug 11 '22 18:08 crusaderky

Note that the scope of this issue is specifically about the state machine. The issue in the network stack is tracked separately, in #6875.

crusaderky avatar Aug 11 '22 18:08 crusaderky