[rollout, vllm] fix: add compatibility shim for vllm process_weights_after_loading

Open Kidand opened this issue 1 month ago • 1 comments

What does this PR do?

This PR fixes a compatibility issue between VERL’s vLLM rollout and certain vLLM versions (e.g. vllm==0.8.5.post1) where importing:

from vllm.model_executor.model_loader.utils import process_weights_after_loading

raises:

ImportError: cannot import name 'process_weights_after_loading'

The function process_weights_after_loading is not present in some vLLM releases, causing VERL’s vllm_rollout_spmd.py to fail during initialization.

This PR adds a compatibility shim:

Try importing the official function when available.
Otherwise fall back to a safe no-op implementation, ensuring backward compatibility.

This unblocks running FSDP + vLLM rollout on versions such as vllm==0.8.5.post1.

Fixes #4202.

Checklist Before Starting

[x] Searched for similar PRs: https://github.com/volcengine/verl/pulls?q=vllm+process_weights_after_loading https://github.com/volcengine/verl/pulls?q=rollout+vllm
[x] Title formatted as: [rollout, vllm] fix: add compatibility shim for vllm process_weights_after_loading

Test

Environment:

torch==2.6.0+cu124
vllm==0.8.5.post1
VERL rollout using FSDP + vLLM (actor_rollout_ref.rollout.name=vllm_rollout, mode=spmd)

1. Sanity check

python - << 'PY'
import torch, vllm
from vllm.platforms import current_platform
print("torch:", torch.__version__)
print("vllm:", vllm.__version__)
print("current_platform:", type(current_platform).__name__)
PY

Output (expected):

torch: 2.6.0+cu124
vllm: 0.8.5.post1
current_platform: NvmlCudaPlatform

2. PPO training test

Before patch:

ImportError: cannot import name 'process_weights_after_loading'

After this PR:

init_model() and update_weights() succeed.
vLLM engines initialize normally across Ray workers.
PPO training proceeds without ImportError.

API and Usage Example

This PR does not modify any public API, CLI, or user-facing config. All configs continue to work as before.

Example rollout configuration:

actor_rollout_ref:
  rollout:
    name: vllm_rollout
    mode: spmd
    tensor_model_parallel_size: 2
    max_model_len: 4096
    engine_kwargs:
      vllm:
        gpu_memory_utilization: 0.9

No changes are required for users.

Design & Code Changes

vLLM recently refactored weight-loading utilities, and some versions do not expose:

vllm.model_executor.model_loader.utils.process_weights_after_loading

This caused VERL’s vLLM rollout to crash during:

model.load_weights(...)
process_weights_after_loading(model, vllm_config, device)

The fix:

try:
    from vllm.model_executor.model_loader.utils import process_weights_after_loading
except Exception:
    def process_weights_after_loading(*args, **kwargs):
        """Compatibility shim for vLLM versions without this helper."""
        return

Benefits:

Safe fallback for older vLLM versions.
Uses official implementation automatically when available.
No behavior changes for users who rely on newer vLLM versions.
No API changes; fully backward-compatible.

Checklist Before Submitting

[x] Read the Contribute Guide: https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md
[x] Run pre-commit: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
[x] Add / update documentation (N/A — internal compatibility fix).
[x] Add or explain tests:
- This is a compatibility shim around vLLM internals; multi-version CI matrix not currently available.
- Verified manually via PPO + vLLM rollout training runs.
[x] When ready, request CI in Slack (ci-request channel) or Feishu.

Nov 26 '25 07:11 Kidand

Hi @chenhaiq @wuxibin89 @PeterSH6

Gentle ping on this PR. 👋

This change fixes a compatibility issue with vllm==0.8.5.post1 (the ImportError on process_weights_after_loading) which currently breaks the rollout initialization.

It introduces a simple shim to ensure backward compatibility without changing existing behavior. Could you please take a look when you have a moment? Thanks!

Dec 03 '25 01:12 Kidand