Ricardo Lu
Ricardo Lu
May vllm can achieve the performance like fastertransformer on inference side? Just curious about the detailed optimization you're done and the goal you want to achieve. BTW, vllm really accelerate...
Adapt from https://github.com/lm-sys/FastChat/blob/v0.2.14/fastchat/serve/openai_api_server.py Test on vicuna-7b-v1.3 and WizardCoder.
Right now vLLM will allocate 90% gpu memory for each accessible gpu card, but when launch server with awq model, it will became a unknow behavior. I run awq format...
### Motivation In the code-llama's deploy tutorial, quantization chapter remains to be done, when will this feature finished? ### Related resources _No response_ ### Additional context _No response_
## ❓ General Questions In every DecodeStep(), it call [SampleTokenFromLogits()](https://github.com/mlc-ai/mlc-llm/blob/3d25d9da762aab7cd89bfffb8b310f515b2ddabb/cpp/llm_chat.cc#L1208) to sample logits, and it will read generation config, which may become a bottleneck for some devices with poor CPU...
**Description** A clear and concise description of what the bug is. When infer with `response = await client.infer()`, it takes a long time for triton server to release the output....