FastChat 使用vllm_worker进行模型加载，卡着不动

使用vllm_worker进行模型加载，卡着不动

Open wfs420100 opened this issue 1 year ago • 2 comments

trafficstars

问题描述

使用第2第3块gpu启动时，卡着不动（而使用1 2、1 3的两两组合则没有问题）

cuda版本：12.1.0 Driver版本: 535.54.03 torch: 2.1.2 fschat: 0.2.34 vllm: 0.2.6 ray: 2.8.1

启动命令

CUDA_VISIBLE_DEVICES="2,3" python -m fastchat.serve.vllm_worker \
  --model-names="qwen-72b-chat" \
  --model-path="/Models/Qwen-72B-Chat" \
  --controller-address=${CONTROLLER_ADDRESS} \
  --worker-address=${WORKER_ADDRESS} \
  --host=${WORKER_HOST} \
  --port=${WORKER_PORT} \
  --trust-remote-code \
  --gpu-memory-utilization=0.98 \
  --dtype=bfloat16 \
  --tensor-parallel-size=2 \
  > z_server_worker.log 2>&1

日志信息

2023-12-19 07:10:25,057	INFO worker.py:1673 -- Started a local Ray instance.
INFO 12-19 07:10:27 llm_engine.py:73] Initializing an LLM engine with config: model='/Models/Qwen-72B-Chat', tokenizer='/Models/Qwen-72B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
WARNING 12-19 07:10:28 tokenizer.py:62] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.

nvidia-smi

Dec 19 '23 07:12 wfs420100

FastChat FastChat copied to clipboard

使用vllm_worker进行模型加载，卡着不动

问题描述

启动命令

日志信息

nvidia-smi

FastChat
FastChat copied to clipboard