aphrodite-engine
aphrodite-engine copied to clipboard
[Bug]: Segmentation fault (core dumped)
Your current environment
(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all# pip3 list | grep aphrodite
aphrodite-engine 0.5.3 /workspace/home/lich/aphrodite-engine
🐛 Describe the bug
I install the aphrodite-engine from the source code by the command "pip install -e ."
(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all# pip3 list | grep aphrodite
aphrodite-engine 0.5.3 /workspace/home/lich/aphrodite-engine
When I use the aphrodite-engine to deploy the model quantized by the QuIP#, downloaded from https://huggingface.co/keyfan/Qwen1.5-72B-Chat-2bit
CUDA_VISIBLE_DEVICES=4 aphrodite run /workspace/data3/lich/Qwen1.5-72B-Chat-2bit/ -q quip --max-model-len 2048 --gpu-memory-utilization 0.5 --swap-space 0
Segmentation fault (core dumped) occurs.
(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all# CUDA_VISIBLE_DEVICES=4 aphrodite run /workspace/data3/lich/Qwen1.5-72B-Chat-2bit/ -q quip --max-model-len 2048 --gpu-memory-utilization 0.5 --swap-space 0
WARNING: quip quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO: Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO: Model = '/workspace/data3/lich/Qwen1.5-72B-Chat-2bit/'
INFO: Speculative Config = None
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = quip
INFO: Context Length = 2048
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: The tokenizer's vocabulary size 151646 does not match the model's vocabulary size 152064.
INFO: Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO: Using XFormers backend.
INFO: Model weights loaded. Memory usage: 20.93 GiB x 1 = 20.93 GiB
INFO: # GPU blocks: 420, # CPU blocks: 0
INFO: Minimum concurrency: 3.28x
INFO: Maximum sequence length allowed in the cache: 6720
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO: Using the default chat template
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO: Started server process [14786]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:2242 (Press CTRL+C to quit)
INFO: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU
KV cache usage: 0.0%
INFO: Received request cmpl-7ac664b9039c4c749e8b2392908bb5f1-0: prompt: '', sampling_params: SamplingParams(mirostat_mode=2, mirostat_tau=6.5, mirostat_eta=0.2),
lora_request: None.
Segmentation fault (core dumped)
(vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all#
The client request code is:
curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/workspace/data3/lich/Qwen1.5-72B-Chat-2bit/",
"prompt": "Hello, who are you?",
"stream": false,
"mirostat_mode": 2,
"mirostat_tau": 6.5,
"mirostat_eta": 0.2
}'
Do you have any idea about the error?
Thanks!