Qwen2.5 [TGI] V100部署qwen2-7b推理服务问题

[TGI] V100部署qwen2-7b推理服务问题

Open charosen opened this issue 6 months ago • 1 comments

请问qwen2-7b模型可以通过TGI或者vllm框架部署在v100的gpu上吗？

我使用tgi的容器镜像来直接部署，发现报错不支持flash attention的问题，用了USE_FLASH_ATTENTION=false，tgi部署过程中仍然报错

MASTER_PORT="$(shuf -n 1 -i 60000-65000)"

docker run -e CUDA_VISIBLE_DEVICES=3 \
    -e USE_FLASH_ATTENTION=false \
    --gpus all \
    --network host \
    --ipc host \
    --shm-size="1G" \
    --privileged \
    --volume=/mnt/user/hj/code/LLaMA-Factory/model:/deploy \
    -it \
    ghcr.io/huggingface/text-generation-inference:2.0.4 \
    --master-port ${MASTER_PORT} \
    --model-id /deploy/TGI/Qwen2-7B-GenModel-V1 \
    --num-shard 1 \
    --port 9401 \
    --router-name tgi-router-qwen2 \
    --max-concurrent-requests 300 \
    --max-top-n-tokens 1 \
    --max-input-length 16000 \
    --max-total-tokens 20000 \
    --waiting-served-ratio 2 \
    --max-batch-prefill-tokens 16000 \
    --max-waiting-tokens 256 \
    --trust-remote-code

报错信息

未设置USE_FLASH_ATTENTION，报错

设置USE_FLASH_ATTENTION=false之后，报错

Aug 22 '24 05:08 charosen

Qwen2.5 Qwen2.5 copied to clipboard

[TGI] V100部署qwen2-7b推理服务问题

Qwen2.5
Qwen2.5 copied to clipboard