vllm
vllm copied to clipboard
[V1] Optimized the `determine_available_memory` method for v1
FIX https://github.com/vllm-project/vllm/issues/17979
In v0 version, the three stages of before_create, before_profile, and after_profile were used to track GPU usage, enabling a more precise calculation of non_torch_memory in the current server.
So I migrated it from v0 to v1.
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
Test:
This is my launch.json:
{
"version": "0.2.0",
"configurations": [
{
"name": "OpenAPI",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/vllm/entrypoints/openai/api_server.py",
"args": [
"--model=/root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B",
"--enable-prefix-caching",
"--trust-remote-code",
"--gpu-memory-utilization=0.3",
"--max_model_len=4096"
],
"env": {
// "VLLM_USE_V1": "0",
}
},
{
"name": "OpenAPI1",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/vllm/entrypoints/openai/api_server.py",
"args": [
"--model=/root/.cache/modelscope/hub/models/Qwen/Qwen3-0.6B",
"--enable-prefix-caching",
"--trust-remote-code",
"--gpu-memory-utilization=0.3",
"--max_model_len=4096"
],
"env": {
// "VLLM_USE_V1": "0",
}
}
]
}
Startup the first server:
gpu info:
root@ubuntu:~# nvidia-smi
Sat May 17 04:51:19 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10 Off | 00000000:03:00.0 Off | 0 |
| 0% 48C P0 59W / 150W | 7619MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 228669 C /usr/bin/python3 7610MiB |
+-----------------------------------------------------------------------------------------+
The second server:
root@ubuntu:~# nvidia-smi
Sat May 17 04:55:03 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10 Off | 00000000:03:00.0 Off | 0 |
| 0% 52C P0 60W / 150W | 15234MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 228669 C /usr/bin/python3 7610MiB |
| 0 N/A N/A 229345 C /usr/bin/python3 7610MiB |
+-----------------------------------------------------------------------------------------+
It can work normally to start two servers on a single GPU card.
@DarkLight1337 /cc,Can you help take a look at this PR?thanks
LGTM, thanks for the fix!
@youkaichao Can you help me with this merge?
We are currently pausing non-essential PRs to get CI green for the release
@calvin0327 please merge main to see if ci passes.
@youkaichao @DarkLight1337 The CI is green expect for ci/pr,Could you please take a look? thanks.
@youkaichao @DarkLight1337 This method is not applicable to v1. v1 will occupy more space than v0, and once the model starts, the space it occupies will exceed its limit. At present, I haven't found out where the extra space is coming from.
@calvin0327 @youkaichao I'll take a detailed look once I'm done reproing an urgent DP/TP bug, but I think this PR looks slightly better than mine, so we should likely do this one. I added a log message though that exists early if not enough memory is available so it would be good to add that here as well.
@calvin0327 @youkaichao I'll take a detailed look once I'm done reproing an urgent DP/TP bug, but I think this PR looks slightly better than mine, so we should likely do this one. I added a log message though that exists early if not enough memory is available so it would be good to add that here as well.
@ProExpertProg Okay, I will add log information in this PR. Currently, many PRs in the community are fixing this one, but there are still some issues with it. I will also improve this PR based on everyone's fixes.
Is there a reason you removed the calls to memory_profiling? I liked that utility but this is fine as well
@ProExpertProg @youkaichao The fundamental difference between the memory_profiling approach and the current one is not significant. After thorough research, I found that the core issue does not lie here but in compile_or_warm_up_model.capture_model. This function consumes a substantial amount of GPU memory, which we failed to account for, leading to the total GPU memory usage exceeding our expectations.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @calvin0327.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@calvin0327 #18974 landed and #19312 is adding more cleanup - should we close this PR and add any additional improvements into #19312?
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!
Superseeded by #18974 and #19312