[vllm] -

Open WoutDeRijck opened this issue 1 year ago • 1 comments

起始日期 | Start Date

9/3/2024

实现PR | Implementation PR

No response

摘要 | Summary

When using vLLM to optimally utilize GPU space for faster inference and generation, there is a noticeable degradation in output quality compared to the original model. This issue aims to address the quality drop and find ways to match the original model's performance while maintaining the speed improvements.

基本示例 | Basic Example

Not complete for example Screenshot 2024-09-03 at 16 39 52

缺陷 | Drawbacks

Current optimization leads to decreased output quality Users may have to choose between speed and quality, which is not ideal Potential increased complexity in configuration to balance speed and quality

未解决问题 | Unresolved questions

What specific aspects of the optimization are causing the quality degradation?
Are there any configuration parameters that can be tuned to improve quality without sacrificing speed?
Is it possible to implement a dynamic system that adjusts optimization based on the specific task or required quality level?
How can we quantify and measure the quality degradation to better address the issue?
Are there any alternative optimization techniques that could provide better quality-speed balance?

Sep 03 '24 14:09 WoutDeRijck

We are doing evaluation on this also, we see the output from vllm endpoint indeed is worse until we read https://docs.vllm.ai/en/latest/models/vlm.html and found certain prompt template need to follow from vllm side. Not sure if this maybe one factor you can check.

Sep 07 '24 01:09 kennethzhu88

[vllm] -

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions