FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Add vllm_worker support for lora_modules

Open x22x22 opened this issue 1 year ago • 1 comments
trafficstars

usage

start

export VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m fastchat.serve.vllm_worker \
    --model-path /data/models/Qwen/Qwen2-72B-Instruct \
    --tokenizer /data/models/Qwen/Qwen2-72B-Instruct  \
    --enable-lora \
    --lora-modules m1=/data/modules/lora/adapter/m1 m2=/data/modules/lora/adapter/m2 m3=/data/modules/lora/adapter/m3 \
    --model-names qwen2-72b-instruct,m1,m2,m3\
    --controller http://localhost:21001 \
    --host 0.0.0.0 \
    --num-gpus 8 \
    --port 31034 \
    --limit-worker-concurrency 100 \
    --worker-address http://localhost:31034

post

  • example1
curl --location --request POST 'http://fastchat_address:port/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-xxx' \
--data-raw '{
    "model": "m1",
    "stream": false,
    "temperature": 0.7,
    "top_p": 0.1,
    "max_tokens": 4096,
    "messages": [
      {
        "role": "user",
        "content": "Hi?"
      }
    ]
  }'
  • example2
curl --location --request POST 'http://fastchat_address:port/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-xxx' \
--data-raw '{
    "model": "qwen2-72b-instruct",
    "stream": false,
    "temperature": 0.7,
    "top_p": 0.1,
    "max_tokens": 4096,
    "messages": [
      {
        "role": "user",
        "content": "Hi?"
      }
    ]
  }'

Why are these changes needed?

Related issue number (if applicable)

Checks

  • [x] I've run format.sh to lint the changes in this PR.
  • [x] I've included any doc changes needed.
  • [x] I've made sure the relevant tests are passing (if applicable).

x22x22 avatar Sep 24 '24 05:09 x22x22