vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[bugfix] avoid NIXL_ERR_REMOTE_DISCONNECT in nixl_connector when Prefill dies

Open hasB4K opened this issue 3 weeks ago • 2 comments

Purpose

Avoid NIXL_ERR_REMOTE_DISCONNECT in NixlConnectorWorker._pop_done_transfers(...) for a Decode instance when a Prefill dies. This is the stacktrace we experienced:

  File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2517, in execute_model
    self.maybe_get_kv_connector_output(scheduler_output) as
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/uv/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 121, in _get_kv_connector_output
    kv_connector.get_finished(scheduler_output.finished_req_ids))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 236, in get_finished
    return self.connector_worker.get_finished()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1131, in get_finished
    done_recving = self._pop_done_transfers(self._recving_transfers)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1207, in _pop_done_transfers
    xfer_state = self.nixl_wrapper.check_xfer_state(handle)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/nixl/_api.py", line 613, in check_xfer_state
    status = self.agent.getXferStatus(handle._handle)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nixl._bindings.nixlRemoteDisconnectError: NIXL_ERR_REMOTE_DISCONNECT

Test Plan

The tests (that simulate UCX/NIXL failures) are being conducted on this PR https://github.com/vllm-project/vllm/pull/27481

hasB4K avatar Nov 05 '25 10:11 hasB4K

💡 Codex Review

https://github.com/vllm-project/vllm/blob/3c271bc9a40690c77978411cd487bfa2b40b43fd/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py#L1613-L1630 P1 Badge Initialize xfer_state before exception logging

The new exception path intends to swallow transport errors, but when check_xfer_state(handle) itself raises, the except block logs xfer_state even though that variable was never assigned. This will raise UnboundLocalError and propagate an unexpected exception instead of handling the remote disconnect, undoing the goal of the change. Initialize xfer_state before the try or log a constant so the handler can run to completion.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @hasB4K.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 14 '25 19:11 mergify[bot]