vllm [ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE

From v2.6.0, torch respects ROCR_VISIBLE_DEVICES on AMD GPU device discovery: https://github.com/pytorch/pytorch/pull/144026 So we no longer need always to set RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES, but we need to pop the ROCR_VISIBLE_DEVICES set by ray when we initialize RayWorkerWrapper, as we still use CUDA_VISIBLE_DEVICES for ROCm, setting both of them at the same time causes conflict.

In https://github.com/ray-project/ray/commit/3b9e729f6a669ffd85190f901f5e262af79771b0, Ray replaces AMD device env var with HIP_VISIBLE_DEVICES, so now we also need to pop the HIP_VISIBLE_DEVICES.

We also need to delay setting USE_ROWWISE_TORCH_SCALED_MM until the first time it is needed, as initially, when importing, CUDA_VISIBLE_DEVICES may not be set properly.

Mar 20 '25 19:03 HollowMan6

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Mar 20 '25 19:03 github-actions[bot]

#15244 has been merged already, is this still relevant?

Mar 21 '25 04:03 DarkLight1337

This PR also did more on the env var management.

Mar 21 '25 07:03 houseroad

Could you describe how you test this PR and the results?

The PR is tested via https://github.com/OpenRLHF/OpenRLHF/blob/fd81b5ed1d4212e9a34f545aae4955ab1c5a17e2/openrlhf/trainer/ray/vllm_engine.py, when RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1 is not set, and distributed_executor_backend is ray, with tensor_parallel_size = 2. Everything should work correctly with this PR now.

If we don't set RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1, without this PR, it will give out the following error:

Error executing method 'init_worker'. This might cause deadlock in distributed execution.
Traceback (most recent call last):
  File "vllm/vllm/worker/worker_base.py", line 612, in execute_method
    return run_method(self, method, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/utils.py", line 2221, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "site-packages/ray/util/tracing/tracing_helper.py", line 463, in _resume_span
    return method(self, *_args, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/worker/worker_base.py", line 594, in init_worker
    self.worker = worker_class(**kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/worker/worker.py", line 82, in __init__
    self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
                                            ^^^^^^^^^^^^^^^^^
  File "vllm/vllm/worker/model_runner.py", line 1062, in __init__
    self.attn_backend = get_attn_backend(
                        ^^^^^^^^^^^^^^^^^
  File "vllm/vllm/attention/selector.py", line 95, in get_attn_backend
    return _cached_get_attn_backend(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/attention/selector.py", line 154, in _cached_get_attn_backend
    return resolve_obj_by_qualname(attention_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/utils.py", line 1904, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "vllm/vllm/attention/backends/rocm_flash_attn.py", line 27, in <module>
    _GPU_ARCH = torch.cuda.get_device_properties("cuda").gcnArchName
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "site-packages/torch/cuda/__init__.py", line 576, in get_device_properties
    _lazy_init()  # will define _get_device_properties
    ^^^^^^^^^^^^
  File "site-packages/torch/cuda/__init__.py", line 372, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No HIP GPUs are available

Mar 21 '25 18:03 HollowMan6

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @HollowMan6.