vllm [V1] Optimized the `determine_available

FIX https://github.com/vllm-project/vllm/issues/17979

In v0 version, the three stages of before_create, before_profile, and after_profile were used to track GPU usage, enabling a more precise calculation of non_torch_memory in the current server.

So I migrated it from v0 to v1.

May 17 '25 04:05 calvin0327

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

May 17 '25 04:05 github-actions[bot]

Test：

This is my launch.json:

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "OpenAPI",
            "type": "debugpy",
            "request": "launch",
            "program": "${workspaceFolder}/vllm/entrypoints/openai/api_server.py",
            "args": [
                "--model=/root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B",
                "--enable-prefix-caching",
                "--trust-remote-code",
                "--gpu-memory-utilization=0.3",
                "--max_model_len=4096"
            ],
            "env": {
                // "VLLM_USE_V1": "0",
            }
        },
        {
            "name": "OpenAPI1",
            "type": "debugpy",
            "request": "launch",
            "program": "${workspaceFolder}/vllm/entrypoints/openai/api_server.py",
            "args": [
                "--model=/root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B",
                "--enable-prefix-caching",
                "--trust-remote-code",
                "--gpu-memory-utilization=0.3",
                "--max_model_len=4096"
            ],
            "env": {
                // "VLLM_USE_V1": "0",
            }
        }
    ]
}

Startup the first server:

gpu info:

root@ubuntu:~# nvidia-smi
Sat May 17 04:51:19 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10                     Off |   00000000:03:00.0 Off |                    0 |
|  0%   48C    P0             59W /  150W |    7619MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    228669      C   /usr/bin/python3                             7610MiB |
+-----------------------------------------------------------------------------------------+

The second server:

root@ubuntu:~# nvidia-smi
Sat May 17 04:55:03 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10                     Off |   00000000:03:00.0 Off |                    0 |
|  0%   52C    P0             60W /  150W |   15234MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    228669      C   /usr/bin/python3                             7610MiB |
|    0   N/A  N/A    229345      C   /usr/bin/python3                             7610MiB |
+-----------------------------------------------------------------------------------------+

It can work normally to start two servers on a single GPU card.

May 17 '25 04:05 calvin0327

@DarkLight1337 /cc，Can you help take a look at this PR？thanks

May 21 '25 04:05 calvin0327

LGTM, thanks for the fix!

@youkaichao Can you help me with this merge?

May 22 '25 02:05 calvin0327

We are currently pausing non-essential PRs to get CI green for the release

May 22 '25 02:05 DarkLight1337

@calvin0327 please merge main to see if ci passes.

May 23 '25 15:05 youkaichao

@youkaichao @DarkLight1337 The CI is green expect for ci/pr，Could you please take a look? thanks.

May 24 '25 06:05 calvin0327

@youkaichao @DarkLight1337 This method is not applicable to v1. v1 will occupy more space than v0, and once the model starts, the space it occupies will exceed its limit. At present, I haven't found out where the extra space is coming from.

Jun 01 '25 08:06 calvin0327

@calvin0327 @youkaichao I'll take a detailed look once I'm done reproing an urgent DP/TP bug, but I think this PR looks slightly better than mine, so we should likely do this one. I added a log message though that exists early if not enough memory is available so it would be good to add that here as well.

Jun 03 '25 18:06 ProExpertProg

@calvin0327 @youkaichao I'll take a detailed look once I'm done reproing an urgent DP/TP bug, but I think this PR looks slightly better than mine, so we should likely do this one. I added a log message though that exists early if not enough memory is available so it would be good to add that here as well.

@ProExpertProg Okay, I will add log information in this PR. Currently, many PRs in the community are fixing this one, but there are still some issues with it. I will also improve this PR based on everyone's fixes.

Jun 04 '25 01:06 calvin0327

Is there a reason you removed the calls to memory_profiling? I liked that utility but this is fine as well

Jun 07 '25 01:06 ProExpertProg

@ProExpertProg @youkaichao The fundamental difference between the memory_profiling approach and the current one is not significant. After thorough research, I found that the core issue does not lie here but in compile_or_warm_up_model.capture_model. This function consumes a substantial amount of GPU memory, which we failed to account for, leading to the total GPU memory usage exceeding our expectations.

Jun 07 '25 01:06 calvin0327

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @calvin0327.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jun 07 '25 17:06 mergify[bot]

@calvin0327 #18974 landed and #19312 is adding more cleanup - should we close this PR and add any additional improvements into #19312?

Jun 09 '25 17:06 ProExpertProg

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Sep 08 '25 02:09 github-actions[bot]

Superseeded by #18974 and #19312

Sep 08 '25 17:09 ProExpertProg

vllm
vllm copied to clipboard

[V1] Optimized the `determine_available_memory` method for v1

vllm vllm copied to clipboard

[V1] Optimized the `determine_available_memory` method for v1

vllm
vllm copied to clipboard