openchat icon indicating copy to clipboard operation
openchat copied to clipboard

Comparison and Quantization Possibilities between vLLM and TensorRT for OpenChat 3.5

Open tsathya98 opened this issue 1 year ago • 0 comments

Hello OpenChat Team,

First and foremost, I would like to express my sincere appreciation for your work on OpenChat 3.5. It's been a go-to model for my projects, and I'm truly impressed by its functionality and performance.

I'm reaching out with a couple of queries related to model optimization for OpenChat 3.5, particularly in the context of vLLM and TensorRT. The README.md notes the use of vLLM for optimizing the API server, which sparked my interest in a deeper comparison.

My primary question is:

  • Has there been any detailed performance comparison between vLLM and TensorRT for the OpenChat 3.5 model? I'm keen on understanding their relative efficiencies and capabilities in practical scenarios.

Additionally, I'm exploring the possibility of model quantization:

  • Is there a method to quantize the OpenChat 3.5 model to FP16 or bF16, and then utilize it with vLLM? If so, has anyone undertaken this process or can provide guidance on how to approach it?

Your insights or directions towards any relevant benchmarks, studies, or documentation would be immensely helpful. As someone who is still exploring LLMs and their optimization techniques, this information is crucial for my ongoing projects and understanding of these technologies.

Thank you for your time and the remarkable effort put into this project.

Best regards, Sathya

tsathya98 avatar Dec 01 '23 08:12 tsathya98