vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES

Open HollowMan6 opened this issue 8 months ago â€ĸ 5 comments

From v2.6.0, torch respects ROCR_VISIBLE_DEVICES on AMD GPU device discovery: https://github.com/pytorch/pytorch/pull/144026 So we no longer need always to set RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES, but we need to pop the ROCR_VISIBLE_DEVICES set by ray when we initialize RayWorkerWrapper, as we still use CUDA_VISIBLE_DEVICES for ROCm, setting both of them at the same time causes conflict.

In https://github.com/ray-project/ray/commit/3b9e729f6a669ffd85190f901f5e262af79771b0, Ray replaces AMD device env var with HIP_VISIBLE_DEVICES, so now we also need to pop the HIP_VISIBLE_DEVICES.

We also need to delay setting USE_ROWWISE_TORCH_SCALED_MM until the first time it is needed, as initially, when importing, CUDA_VISIBLE_DEVICES may not be set properly.

HollowMan6 avatar Mar 20 '25 19:03 HollowMan6

👋 Hi! Thank you for contributing to the vLLM project.

đŸ’Ŧ Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Mar 20 '25 19:03 github-actions[bot]

#15244 has been merged already, is this still relevant?

DarkLight1337 avatar Mar 21 '25 04:03 DarkLight1337

This PR also did more on the env var management.

houseroad avatar Mar 21 '25 07:03 houseroad

Could you describe how you test this PR and the results?

The PR is tested via https://github.com/OpenRLHF/OpenRLHF/blob/fd81b5ed1d4212e9a34f545aae4955ab1c5a17e2/openrlhf/trainer/ray/vllm_engine.py, when RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1 is not set, and distributed_executor_backend is ray, with tensor_parallel_size = 2. Everything should work correctly with this PR now.

If we don't set RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1, without this PR, it will give out the following error:

Error executing method 'init_worker'. This might cause deadlock in distributed execution.
Traceback (most recent call last):
  File "vllm/vllm/worker/worker_base.py", line 612, in execute_method
    return run_method(self, method, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/utils.py", line 2221, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "site-packages/ray/util/tracing/tracing_helper.py", line 463, in _resume_span
    return method(self, *_args, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/worker/worker_base.py", line 594, in init_worker
    self.worker = worker_class(**kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/worker/worker.py", line 82, in __init__
    self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
                                            ^^^^^^^^^^^^^^^^^
  File "vllm/vllm/worker/model_runner.py", line 1062, in __init__
    self.attn_backend = get_attn_backend(
                        ^^^^^^^^^^^^^^^^^
  File "vllm/vllm/attention/selector.py", line 95, in get_attn_backend
    return _cached_get_attn_backend(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/attention/selector.py", line 154, in _cached_get_attn_backend
    return resolve_obj_by_qualname(attention_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "vllm/vllm/utils.py", line 1904, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "vllm/vllm/attention/backends/rocm_flash_attn.py", line 27, in <module>
    _GPU_ARCH = torch.cuda.get_device_properties("cuda").gcnArchName
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "site-packages/torch/cuda/__init__.py", line 576, in get_device_properties
    _lazy_init()  # will define _get_device_properties
    ^^^^^^^^^^^^
  File "site-packages/torch/cuda/__init__.py", line 372, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No HIP GPUs are available

HollowMan6 avatar Mar 21 '25 18:03 HollowMan6

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 01 '25 08:04 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 01 '25 18:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 28 '25 10:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar May 31 '25 14:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Aug 07 '25 16:08 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Aug 26 '25 14:08 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Sep 09 '25 23:09 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Oct 05 '25 14:10 mergify[bot]

cc @mgoin for discussions under https://github.com/vllm-project/vllm/pull/12695#discussion_r2478925809

HollowMan6 avatar Oct 30 '25 17:10 HollowMan6

/gemini review

HollowMan6 avatar Oct 31 '25 10:10 HollowMan6

@codex review

HollowMan6 avatar Oct 31 '25 10:10 HollowMan6

Codex Review: Didn't find any major issues. :tada:

â„šī¸ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".