vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Usage]:

Open zswodegit opened this issue 5 months ago • 15 comments
trafficstars

Your current environment

ubuntu 22.04, jetson agx orin 64 G, vllm==0.7.0

How would you like to use vllm

i want to decrease the vRAM usage of vLLM by using parameter gpu-memory-utilisation, but i notice that it does not work at all? No matter what value i set for the gpu-memory-utilisation, it never behaves as expected. For example, my single GPU has 64GB, when i set --gpu-memory-utilisation 0.2, the actual vRAM usage reaches 46GB, its strange that setting --gpu-memory-utilisation 0.8 reduce vRAM to 38GB?

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

zswodegit avatar May 25 '25 10:05 zswodegit

You misspelled the parameter name, it should be --gpu-memory-utilization with a "z" in it

DarkLight1337 avatar May 25 '25 14:05 DarkLight1337

You misspelled the parameter name, it should be --gpu-memory-utilization with a "z" in it

yeah, i misspelled here but its correct in my script, opposite a error will occurs and the resource taken will not change at all

zswodegit avatar May 26 '25 02:05 zswodegit

Can you show the full command you used?

DarkLight1337 avatar May 26 '25 03:05 DarkLight1337

Can you show the full command you used?

vllm-serve.sh vllm serve /home/nvidia/zsx/ckpt/Qwen2.5-VL-7B-Instruct
--tensor-parallel-size 1
--gpu-memory-utilization 0.8
--max-model-len 4096 \

Image

ps: my machine is jetson agx orin 64G, so maybe an adaptation issue?

zswodegit avatar May 26 '25 03:05 zswodegit

--gpu-memory-utilization refers to VRAM (memory) usage, not the computation usage of the card.

DarkLight1337 avatar May 26 '25 03:05 DarkLight1337

If you run nvidia-smi it should show that only 20% of the memory is used

DarkLight1337 avatar May 26 '25 03:05 DarkLight1337

If you run nvidia-smi it should show that only 20% of the memory is used

unfortunately the result of nvidia-smi is same than jtop one, meaning it takes always 40G for gpu-memory-utilization=0.2. i noticed that it does not work either on H20? and another issue is that i can not run VLLM across multi-GPUs. i tried setting tensor-parallel-size=8, but it always runs on GPU0. i also attempted export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 without any effect

my script: vllm serve /path/to/Qwen2.5-VL-7B-Instruct \ --tensor-parallel-size 8 \ --pipeline-parallel-size 1 \ --num-gpus 1 \ --gpu-memory-utilization 0.7 \ --max-model-len 4096 \

my code:

url = "http://localhost:8000/v1/chat/completions" headers = { "Content-Type': 'application/json"} data = {"model": "/path/to/Qwen2.5-VL-7B-Instruct", "max_tokens": 1, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "say something"}]} response = requests.post(url, headers=headers, data=json.dumps(data))

Image

zswodegit avatar May 27 '25 03:05 zswodegit

Can you show how you're launching vLLM? Are you using command line directly or are you launching it from another script?

DarkLight1337 avatar May 27 '25 03:05 DarkLight1337

Can you show how you're launching vLLM? Are you using command line directly or are you launching it from another script?

In my previous comment, I mentioned that I do inference using vllm serve and requests.post

zswodegit avatar May 27 '25 05:05 zswodegit

I see you set both --num-gpus and --tensor-parallel-size. But --num-gpus isn't a parameter in vLLM, so an error should result from that.

Edit: Just noticed that you're using an old version in vLLM. Maybe @youkaichao has more context about this then

DarkLight1337 avatar May 27 '25 05:05 DarkLight1337

I see you set both --num-gpus and --tensor-parallel-size. But --num-gpus isn't a parameter in vLLM, so an error should result from that.

Edit: Just noticed that you're using an old version in vLLM. Maybe @youkaichao has more context about this then

it just because --tensor-parallel-size does not work, so i was trying to use this parameter

zswodegit avatar May 27 '25 05:05 zswodegit

Just to check whether CUDA_VISIBLE_DEVICES is working properly, can you try importing vanilla PyTorch and see if idle memory is allocated in the correct GPUs?

DarkLight1337 avatar May 27 '25 06:05 DarkLight1337

Just to check whether CUDA_VISIBLE_DEVICES is working properly, can you try importing vanilla PyTorch and see if idle memory is allocated in the correct GPUs?

ofc, it allocate correctly

Image

Image

zswodegit avatar May 27 '25 06:05 zswodegit

cc @youkaichao

DarkLight1337 avatar May 27 '25 06:05 DarkLight1337

cc @youkaichao

finally i figure out this issue by using -tp instead of --tensor-parallel-size which does not work at all!!!

zswodegit avatar May 28 '25 01:05 zswodegit

cc @youkaichao

finally i figure out this issue by using -tp instead of --tensor-parallel-size which does not work at all!!!

so its a same problem with gpu-memory-utilization?

zswodegit avatar May 28 '25 02:05 zswodegit

I suggest updating vLLM to see if the problem goes away

DarkLight1337 avatar May 28 '25 02:05 DarkLight1337

I suggest updating vLLM to see if the problem goes away

but vllm version is already the latest one (0.8.5.post1)

zswodegit avatar May 28 '25 05:05 zswodegit

According to your first post, you used ubuntu 22.04, jetson agx orin 64 G, vllm==0.7.0

DarkLight1337 avatar May 28 '25 05:05 DarkLight1337

Can you show what your latest setup/code looks like now?

DarkLight1337 avatar May 28 '25 05:05 DarkLight1337

According to your first post, you used ubuntu 22.04, jetson agx orin 64 G, vllm==0.7.0

yeah at the begging i was running on AGX Orin, and then i goes to my server (8*H20, ubuntu 24.04, vllm==0.8.5.post1)

zswodegit avatar May 28 '25 08:05 zswodegit

for the original problem, jetson agx orin has unified memory, when a process uses cpu memory, it will also appear in gpu memory monitor tools i think.

youkaichao avatar Jun 03 '25 08:06 youkaichao

for the original problem, jetson agx orin has unified memory, when a process uses cpu memory, it will also appear in gpu memory monitor tools i think.

emmm i dont think so, because H20 has the same phenomenon

zswodegit avatar Jun 06 '25 08:06 zswodegit

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Sep 05 '25 02:09 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Oct 06 '25 02:10 github-actions[bot]