FastChat
FastChat copied to clipboard
使用vllm_worker进行模型加载,卡着不动
trafficstars
问题描述
使用第2第3块gpu启动时,卡着不动(而使用1 2、1 3的两两组合则没有问题)
cuda版本:12.1.0 Driver版本: 535.54.03 torch: 2.1.2 fschat: 0.2.34 vllm: 0.2.6 ray: 2.8.1
启动命令
CUDA_VISIBLE_DEVICES="2,3" python -m fastchat.serve.vllm_worker \
--model-names="qwen-72b-chat" \
--model-path="/Models/Qwen-72B-Chat" \
--controller-address=${CONTROLLER_ADDRESS} \
--worker-address=${WORKER_ADDRESS} \
--host=${WORKER_HOST} \
--port=${WORKER_PORT} \
--trust-remote-code \
--gpu-memory-utilization=0.98 \
--dtype=bfloat16 \
--tensor-parallel-size=2 \
> z_server_worker.log 2>&1
日志信息
2023-12-19 07:10:25,057 INFO worker.py:1673 -- Started a local Ray instance.
INFO 12-19 07:10:27 llm_engine.py:73] Initializing an LLM engine with config: model='/Models/Qwen-72B-Chat', tokenizer='/Models/Qwen-72B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
WARNING 12-19 07:10:28 tokenizer.py:62] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.