vllm
vllm copied to clipboard
[Usage]:
Your current environment
ubuntu 22.04, jetson agx orin 64 G, vllm==0.7.0
How would you like to use vllm
i want to decrease the vRAM usage of vLLM by using parameter gpu-memory-utilisation, but i notice that it does not work at all? No matter what value i set for the gpu-memory-utilisation, it never behaves as expected. For example, my single GPU has 64GB, when i set --gpu-memory-utilisation 0.2, the actual vRAM usage reaches 46GB, its strange that setting --gpu-memory-utilisation 0.8 reduce vRAM to 38GB?
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
You misspelled the parameter name, it should be --gpu-memory-utilization with a "z" in it
You misspelled the parameter name, it should be
--gpu-memory-utilizationwith a "z" in it
yeah, i misspelled here but its correct in my script, opposite a error will occurs and the resource taken will not change at all
Can you show the full command you used?
Can you show the full command you used?
vllm-serve.sh
vllm serve /home/nvidia/zsx/ckpt/Qwen2.5-VL-7B-Instruct
--tensor-parallel-size 1
--gpu-memory-utilization 0.8
--max-model-len 4096 \
ps: my machine is jetson agx orin 64G, so maybe an adaptation issue?
--gpu-memory-utilization refers to VRAM (memory) usage, not the computation usage of the card.
If you run nvidia-smi it should show that only 20% of the memory is used
If you run
nvidia-smiit should show that only 20% of the memory is used
unfortunately the result of nvidia-smi is same than jtop one, meaning it takes always 40G for gpu-memory-utilization=0.2. i noticed that it does not work either on H20? and another issue is that i can not run VLLM across multi-GPUs. i tried setting tensor-parallel-size=8, but it always runs on GPU0. i also attempted export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 without any effect
my script:
vllm serve /path/to/Qwen2.5-VL-7B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 1 \
--num-gpus 1 \
--gpu-memory-utilization 0.7 \
--max-model-len 4096 \
my code:
url = "http://localhost:8000/v1/chat/completions"
headers = { "Content-Type': 'application/json"}
data = {"model": "/path/to/Qwen2.5-VL-7B-Instruct", "max_tokens": 1, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "say something"}]}
response = requests.post(url, headers=headers, data=json.dumps(data))
Can you show how you're launching vLLM? Are you using command line directly or are you launching it from another script?
Can you show how you're launching vLLM? Are you using command line directly or are you launching it from another script?
In my previous comment, I mentioned that I do inference using vllm serve and requests.post
I see you set both --num-gpus and --tensor-parallel-size. But --num-gpus isn't a parameter in vLLM, so an error should result from that.
Edit: Just noticed that you're using an old version in vLLM. Maybe @youkaichao has more context about this then
I see you set both
--num-gpusand--tensor-parallel-size. But--num-gpusisn't a parameter in vLLM, so an error should result from that.Edit: Just noticed that you're using an old version in vLLM. Maybe @youkaichao has more context about this then
it just because --tensor-parallel-size does not work, so i was trying to use this parameter
Just to check whether CUDA_VISIBLE_DEVICES is working properly, can you try importing vanilla PyTorch and see if idle memory is allocated in the correct GPUs?
Just to check whether
CUDA_VISIBLE_DEVICESis working properly, can you try importing vanilla PyTorch and see if idle memory is allocated in the correct GPUs?
ofc, it allocate correctly
cc @youkaichao
cc @youkaichao
finally i figure out this issue by using -tp instead of --tensor-parallel-size which does not work at all!!!
cc @youkaichao
finally i figure out this issue by using
-tpinstead of--tensor-parallel-sizewhich does not work at all!!!
so its a same problem with gpu-memory-utilization?
I suggest updating vLLM to see if the problem goes away
I suggest updating vLLM to see if the problem goes away
but vllm version is already the latest one (0.8.5.post1)
According to your first post, you used ubuntu 22.04, jetson agx orin 64 G, vllm==0.7.0
Can you show what your latest setup/code looks like now?
According to your first post, you used
ubuntu 22.04, jetson agx orin 64 G, vllm==0.7.0
yeah at the begging i was running on AGX Orin, and then i goes to my server (8*H20, ubuntu 24.04, vllm==0.8.5.post1)
for the original problem, jetson agx orin has unified memory, when a process uses cpu memory, it will also appear in gpu memory monitor tools i think.
for the original problem, jetson agx orin has unified memory, when a process uses cpu memory, it will also appear in gpu memory monitor tools i think.
emmm i dont think so, because H20 has the same phenomenon
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!