[vllm] -
起始日期 | Start Date
9/3/2024
实现PR | Implementation PR
No response
相关Issues | Reference Issues
No response
摘要 | Summary
When using vLLM to optimally utilize GPU space for faster inference and generation, there is a noticeable degradation in output quality compared to the original model. This issue aims to address the quality drop and find ways to match the original model's performance while maintaining the speed improvements.
基本示例 | Basic Example
Not complete for example
缺陷 | Drawbacks
Current optimization leads to decreased output quality Users may have to choose between speed and quality, which is not ideal Potential increased complexity in configuration to balance speed and quality
未解决问题 | Unresolved questions
- What specific aspects of the optimization are causing the quality degradation?
- Are there any configuration parameters that can be tuned to improve quality without sacrificing speed?
- Is it possible to implement a dynamic system that adjusts optimization based on the specific task or required quality level?
- How can we quantify and measure the quality degradation to better address the issue?
- Are there any alternative optimization techniques that could provide better quality-speed balance?
We are doing evaluation on this also, we see the output from vllm endpoint indeed is worse until we read https://docs.vllm.ai/en/latest/models/vlm.html and found certain prompt template need to follow from vllm side. Not sure if this maybe one factor you can check.