aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

[Bug]: Segmentation fault (core dumped)

Open ChuanhongLi opened this issue 8 months ago • 1 comments

Your current environment

(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all# pip3 list | grep aphrodite
aphrodite-engine         0.5.3        /workspace/home/lich/aphrodite-engine

🐛 Describe the bug

I install the aphrodite-engine from the source code by the command "pip install -e ."

(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all# pip3 list | grep aphrodite
aphrodite-engine         0.5.3        /workspace/home/lich/aphrodite-engine

When I use the aphrodite-engine to deploy the model quantized by the QuIP#, downloaded from https://huggingface.co/keyfan/Qwen1.5-72B-Chat-2bit

CUDA_VISIBLE_DEVICES=4 aphrodite run /workspace/data3/lich/Qwen1.5-72B-Chat-2bit/ -q quip --max-model-len 2048 --gpu-memory-utilization 0.5 --swap-space 0

Segmentation fault (core dumped) occurs.

(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all# CUDA_VISIBLE_DEVICES=4 aphrodite run /workspace/data3/lich/Qwen1.5-72B-Chat-2bit/ -q quip --max-model-len 2048 --gpu-memory-utilization 0.5 --swap-space 0
WARNING:  quip quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO:     Model = '/workspace/data3/lich/Qwen1.5-72B-Chat-2bit/'
INFO:     Speculative Config = None
INFO:     DataType = torch.float16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = quip
INFO:     Context Length = 2048
INFO:     Enforce Eager Mode = True
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  The tokenizer's vocabulary size 151646 does not match the model's vocabulary size 152064.
INFO:     Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO:     Using XFormers backend.
INFO:     Model weights loaded. Memory usage: 20.93 GiB x 1 = 20.93 GiB
INFO:     # GPU blocks: 420, # CPU blocks: 0
INFO:     Minimum concurrency: 3.28x
INFO:     Maximum sequence length allowed in the cache: 6720
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Using the default chat template
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [14786]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:2242 (Press CTRL+C to quit)
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU
KV cache usage: 0.0%
INFO:     Received request cmpl-7ac664b9039c4c749e8b2392908bb5f1-0: prompt: '', sampling_params: SamplingParams(mirostat_mode=2, mirostat_tau=6.5, mirostat_eta=0.2),
lora_request: None.
Segmentation fault (core dumped)
(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all#

The client request code is:

curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "/workspace/data3/lich/Qwen1.5-72B-Chat-2bit/",
  "prompt": "Hello, who are you?",
  "stream": false,
  "mirostat_mode": 2,
  "mirostat_tau": 6.5,
  "mirostat_eta": 0.2
}'

Do you have any idea about the error?

Thanks!

ChuanhongLi avatar Jun 03 '24 07:06 ChuanhongLi