vllm [Usage]:

trafficstars

Your current environment

ubuntu 22.04, jetson agx orin 64 G, vllm==0.7.0

How would you like to use vllm

i want to decrease the vRAM usage of vLLM by using parameter gpu-memory-utilisation, but i notice that it does not work at all? No matter what value i set for the gpu-memory-utilisation, it never behaves as expected. For example, my single GPU has 64GB, when i set --gpu-memory-utilisation 0.2, the actual vRAM usage reaches 46GB, its strange that setting --gpu-memory-utilisation 0.8 reduce vRAM to 38GB?

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

May 25 '25 10:05 zswodegit

You misspelled the parameter name, it should be --gpu-memory-utilization with a "z" in it

May 25 '25 14:05 DarkLight1337

You misspelled the parameter name, it should be --gpu-memory-utilization with a "z" in it

yeah, i misspelled here but its correct in my script, opposite a error will occurs and the resource taken will not change at all

May 26 '25 02:05 zswodegit

Can you show the full command you used?

May 26 '25 03:05 DarkLight1337

Can you show the full command you used?

vllm-serve.sh vllm serve /home/nvidia/zsx/ckpt/Qwen2.5-VL-7B-Instruct
--tensor-parallel-size 1
--gpu-memory-utilization 0.8
--max-model-len 4096 \

ps: my machine is jetson agx orin 64G, so maybe an adaptation issue?

May 26 '25 03:05 zswodegit

--gpu-memory-utilization refers to VRAM (memory) usage, not the computation usage of the card.

May 26 '25 03:05 DarkLight1337

If you run nvidia-smi it should show that only 20% of the memory is used

May 26 '25 03:05 DarkLight1337

If you run nvidia-smi it should show that only 20% of the memory is used

unfortunately the result of nvidia-smi is same than jtop one, meaning it takes always 40G for gpu-memory-utilization=0.2. i noticed that it does not work either on H20? and another issue is that i can not run VLLM across multi-GPUs. i tried setting tensor-parallel-size=8, but it always runs on GPU0. i also attempted export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 without any effect

my script: vllm serve /path/to/Qwen2.5-VL-7B-Instruct \ --tensor-parallel-size 8 \ --pipeline-parallel-size 1 \ --num-gpus 1 \ --gpu-memory-utilization 0.7 \ --max-model-len 4096 \

my code:

url = "http://localhost:8000/v1/chat/completions" headers = { "Content-Type': 'application/json"} data = {"model": "/path/to/Qwen2.5-VL-7B-Instruct", "max_tokens": 1, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "say something"}]} response = requests.post(url, headers=headers, data=json.dumps(data))

May 27 '25 03:05 zswodegit

Can you show how you're launching vLLM? Are you using command line directly or are you launching it from another script?

May 27 '25 03:05 DarkLight1337

Can you show how you're launching vLLM? Are you using command line directly or are you launching it from another script?

In my previous comment, I mentioned that I do inference using vllm serve and requests.post

May 27 '25 05:05 zswodegit

I see you set both --num-gpus and --tensor-parallel-size. But --num-gpus isn't a parameter in vLLM, so an error should result from that.

Edit: Just noticed that you're using an old version in vLLM. Maybe @youkaichao has more context about this then

May 27 '25 05:05 DarkLight1337

I see you set both --num-gpus and --tensor-parallel-size. But --num-gpus isn't a parameter in vLLM, so an error should result from that.

Edit: Just noticed that you're using an old version in vLLM. Maybe @youkaichao has more context about this then

it just because --tensor-parallel-size does not work, so i was trying to use this parameter

May 27 '25 05:05 zswodegit

Just to check whether CUDA_VISIBLE_DEVICES is working properly, can you try importing vanilla PyTorch and see if idle memory is allocated in the correct GPUs?

May 27 '25 06:05 DarkLight1337

Just to check whether CUDA_VISIBLE_DEVICES is working properly, can you try importing vanilla PyTorch and see if idle memory is allocated in the correct GPUs?

ofc, it allocate correctly

May 27 '25 06:05 zswodegit

cc @youkaichao

May 27 '25 06:05 DarkLight1337

cc @youkaichao

finally i figure out this issue by using -tp instead of --tensor-parallel-size which does not work at all!!!

May 28 '25 01:05 zswodegit

cc @youkaichao

finally i figure out this issue by using -tp instead of --tensor-parallel-size which does not work at all!!!

so its a same problem with gpu-memory-utilization?

May 28 '25 02:05 zswodegit

I suggest updating vLLM to see if the problem goes away

May 28 '25 02:05 DarkLight1337

I suggest updating vLLM to see if the problem goes away

but vllm version is already the latest one (0.8.5.post1)

May 28 '25 05:05 zswodegit

According to your first post, you used ubuntu 22.04, jetson agx orin 64 G, vllm==0.7.0

May 28 '25 05:05 DarkLight1337

Can you show what your latest setup/code looks like now?

May 28 '25 05:05 DarkLight1337

According to your first post, you used ubuntu 22.04, jetson agx orin 64 G, vllm==0.7.0

yeah at the begging i was running on AGX Orin, and then i goes to my server (8*H20, ubuntu 24.04, vllm==0.8.5.post1)

May 28 '25 08:05 zswodegit

for the original problem, jetson agx orin has unified memory, when a process uses cpu memory, it will also appear in gpu memory monitor tools i think.

Jun 03 '25 08:06 youkaichao

for the original problem, jetson agx orin has unified memory, when a process uses cpu memory, it will also appear in gpu memory monitor tools i think.

emmm i dont think so, because H20 has the same phenomenon

Jun 06 '25 08:06 zswodegit

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Sep 05 '25 02:09 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Oct 06 '25 02:10 github-actions[bot]

vllm vllm copied to clipboard

[Usage]:

Your current environment

How would you like to use vllm

Before submitting a new issue...

vllm
vllm copied to clipboard