FastChat
FastChat copied to clipboard
Add vllm_worker support for lora_modules
trafficstars
usage
start
export VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m fastchat.serve.vllm_worker \
--model-path /data/models/Qwen/Qwen2-72B-Instruct \
--tokenizer /data/models/Qwen/Qwen2-72B-Instruct \
--enable-lora \
--lora-modules m1=/data/modules/lora/adapter/m1 m2=/data/modules/lora/adapter/m2 m3=/data/modules/lora/adapter/m3 \
--model-names qwen2-72b-instruct,m1,m2,m3\
--controller http://localhost:21001 \
--host 0.0.0.0 \
--num-gpus 8 \
--port 31034 \
--limit-worker-concurrency 100 \
--worker-address http://localhost:31034
post
- example1
curl --location --request POST 'http://fastchat_address:port/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-xxx' \
--data-raw '{
"model": "m1",
"stream": false,
"temperature": 0.7,
"top_p": 0.1,
"max_tokens": 4096,
"messages": [
{
"role": "user",
"content": "Hi?"
}
]
}'
- example2
curl --location --request POST 'http://fastchat_address:port/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-xxx' \
--data-raw '{
"model": "qwen2-72b-instruct",
"stream": false,
"temperature": 0.7,
"top_p": 0.1,
"max_tokens": 4096,
"messages": [
{
"role": "user",
"content": "Hi?"
}
]
}'
Why are these changes needed?
Related issue number (if applicable)
Checks
- [x] I've run
format.shto lint the changes in this PR. - [x] I've included any doc changes needed.
- [x] I've made sure the relevant tests are passing (if applicable).