vllm
vllm copied to clipboard
[bugfix] avoid NIXL_ERR_REMOTE_DISCONNECT in nixl_connector when Prefill dies
Purpose
Avoid NIXL_ERR_REMOTE_DISCONNECT in NixlConnectorWorker._pop_done_transfers(...) for a Decode instance when a Prefill dies. This is the stacktrace we experienced:
File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2517, in execute_model
self.maybe_get_kv_connector_output(scheduler_output) as
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/uv/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
next(self.gen)
File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 121, in _get_kv_connector_output
kv_connector.get_finished(scheduler_output.finished_req_ids))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 236, in get_finished
return self.connector_worker.get_finished()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1131, in get_finished
done_recving = self._pop_done_transfers(self._recving_transfers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1207, in _pop_done_transfers
xfer_state = self.nixl_wrapper.check_xfer_state(handle)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/nixl/_api.py", line 613, in check_xfer_state
status = self.agent.getXferStatus(handle._handle)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nixl._bindings.nixlRemoteDisconnectError: NIXL_ERR_REMOTE_DISCONNECT
Test Plan
The tests (that simulate UCX/NIXL failures) are being conducted on this PR https://github.com/vllm-project/vllm/pull/27481
💡 Codex Review
https://github.com/vllm-project/vllm/blob/3c271bc9a40690c77978411cd487bfa2b40b43fd/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py#L1613-L1630
Initialize xfer_state before exception logging
The new exception path intends to swallow transport errors, but when check_xfer_state(handle) itself raises, the except block logs xfer_state even though that variable was never assigned. This will raise UnboundLocalError and propagate an unexpected exception instead of handling the remote disconnect, undoing the goal of the change. Initialize xfer_state before the try or log a constant so the handler can run to completion.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @hasB4K.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork