vllm
vllm copied to clipboard
[Usage]:Input prompt (2501 tokens) is too long and exceeds limit of 2048
Your current environment
env: 16*H800
model:deepseekr1
version:0.7.2
start scrpts:python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 80 --max-model-len 128000 --trust-remote-code --pipeline-parallel-size 2 --tensor-parallel-size 8 --gpu-memory-utilization 0.8 --served-model-name deepseek --model /mnt/workspace/models/public-models/llm/DeepSeek-R1
“WARNING 02-16 15:37:32 scheduler.py:947] Input prompt (2501 tokens) is too long and exceeds limit of 2048
INFO 02-16 15:37:32 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.3 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
”
How would you like to use vllm
Which parameter of the startup command can solve the problem
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.