lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Question] When will lmdploy support code llama quantization?

Open gesanqiu opened this issue 2 years ago • 7 comments

Motivation

In the code-llama's deploy tutorial, quantization chapter remains to be done, when will this feature finished?

Related resources

No response

Additional context

No response

gesanqiu avatar Sep 25 '23 11:09 gesanqiu

After Mid-Autumn Festival, before 10.20

lvhan028 avatar Sep 25 '23 13:09 lvhan028

Not realizing LMDeploy didn't already support codellama quants, I ended up AWQ quantizing Phind's codellama fine-tune, maybe it can be useful for testing: poisson-fish/Phind-CodeLlama-34B-v2-AWQ The quantization itself completed successfully with no problems, however running inference on the model obviously doesn't work.

poisson-fish avatar Sep 26 '23 19:09 poisson-fish

@lvhan028 Is this still on plan?

gesanqiu avatar Nov 21 '23 03:11 gesanqiu

@pppppM tried, but the performance significantly decreased after quantization

lvhan028 avatar Nov 21 '23 03:11 lvhan028

@pppppM tried, but the performance significantly decreased after quantization

@lvhan028 @pppppM Can I ask in which part you meet the bottleneck? Cause codellama has same archtecture with llams-2, why this happened?

gesanqiu avatar Nov 22 '23 03:11 gesanqiu

@gesanqiu LMDeploy is functionally capable of the quantization of CodeLlama, but in practical use we found that performance significantly declines after quantization.

We are also investigating the specific reasons for this, and what we've found so far is that there are more outliers in the model weights of CodeLlama, especially when you compare it to Llama2.

pppppM avatar Nov 22 '23 04:11 pppppM

@gesanqiu LMDeploy is functionally capable of the quantization of CodeLlama, but in practical use we found that performance significantly declines after quantization. We are also investigating the specific reasons for this, and what we've found so far is that there are more outliers in the model weights of CodeLlama, especially when you compare it to Llama2.

Do you mean you meet accuracy issue? May smoothquant help with this issue? And have you been tested the throughput or latenct of AWQ codellama model on lmdeploy?

gesanqiu avatar Nov 22 '23 07:11 gesanqiu

May try v0.4.2.

lvhan028 avatar Jun 12 '24 03:06 lvhan028