Ricardo Lu issues

Results 6 issues of


                                            Ricardo Lu

What's the difference between vllm and triton-inference-server?

May vllm can achieve the performance like fastertransformer on inference side? Just curious about the detailed optimization you're done and the goal you want to achieve. BTW, vllm really accelerate...

feat: add ChatCompletion endpoint in OpenAI demo server.

Adapt from https://github.com/lm-sys/FastChat/blob/v0.2.14/fastchat/serve/openai_api_server.py Test on vicuna-7b-v1.3 and WizardCoder.

Error gpu memory utilization with awq model when tp>1.

Right now vLLM will allocate 90% gpu memory for each accessible gpu card, but when launch server with awq model, it will became a unknow behavior. I run awq format...

[Question] When will lmdploy support code llama quantization?

### Motivation In the code-llama's deploy tutorial, quantization chapter remains to be done, when will this feature finished? ### Related resources _No response_ ### Additional context _No response_

backlog

[Question] Why read generation config in every decode step?

## ❓ General Questions In every DecodeStep(), it call [SampleTokenFromLogits()](https://github.com/mlc-ai/mlc-llm/blob/3d25d9da762aab7cd89bfffb8b310f515b2ddabb/cpp/llm_chat.cc#L1208) to sample logits, and it will read generation config, which may become a bottleneck for some devices with poor CPU...

question

Large latency when use `tritonclient.http.aio.infer`

**Description** A clear and concise description of what the bug is. When infer with `response = await client.infer()`, it takes a long time for triton server to release the output....