B60 single card load multi-instance offloading on cpu

Open Lucas-cai opened this issue 2 months ago • 1 comments

Used ZE_AFFINITY_MASK=0 tp=1 make single card run multi-instance on B60. The multi-instance include same model running on different port causing the problem that model offloading on cpu & memory rather than OOM. Here is the script below:

export ZE_AFFINITY_MASK=0 export TORCH_LLM_ALLREDUCE=1 export VLLM_USE_V1=1 export CCL_ZE_IPC_EXCHANGE=pidfd export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn python3 -m vllm.entrypoints.openai.api_server --model /llm/models/DeepSeek-R1-Distill-Qwen-7B --served-model-name DeepSeek-R1-Distill-Qwen-7B --dtype=float16 --enforce-eager --port 8000 --host 0.0.0.0 --trust-remote-code --disable-sliding-window --gpu-memory-util=0.9 --no-enable-prefix-caching --max-num-batched-tokens=8192 --disable-log-requests --max-model-len=8192 --block-size 64 --tensor-parallel-size 1 --reasoning-parser deepseek_r1 -tp=1

The trend of memory:

The trend of Vision Memory:

The model contain 8 billion parameters, estimated 16GiB Vision Memory occupied in FP16.

Could you help me figure out this problem to support customer enablement? Thanks!

Please contact me email or teams if any further detail needed. Email : [email protected]

Oct 30 '25 08:10 Lucas-cai

Hi @Lucas-cai,

Thank you for your contribution! We appreciate your engagement with this issue. To proceed further, we require some additional information. We have already been in touch with you via direct message on Intel’s internal system to request these details. At this point, we are waiting for your response with the requested information. Please note that once we receive your input, this issue will be further dispatched and handled through our internal processes. We will keep you updated as we make progress.

Nov 17 '25 05:11 kgibala