inference icon indicating copy to clipboard operation
inference copied to clipboard

QUESTION: How to use both gptq and vllm in qwen-14b model?

Open tms2003 opened this issue 1 year ago • 5 comments

The vllm version already supports gptq quantization of qwen 14b-4bit (in the quantization branch), so how do I use them in xinference?

tms2003 avatar Nov 29 '23 10:11 tms2003

We need to test it, some adaptation work may be required.

aresnow1 avatar Nov 30 '23 04:11 aresnow1

It seems vllm's support for gptq is in progress(https://github.com/vllm-project/vllm/pull/1580), how did you use vllm with GPTQ?

aresnow1 avatar Nov 30 '23 06:11 aresnow1

@aresnow1 The branch code of vllm can be found at https://github.com/chu-tianxiang/vllm-gptq/. It is a branch from vllm. All questions can be asked on the main branch of vllm, so it is the correct official one. The code, but due to many conflicts in the code, was not merged into the main branch. After testing, its code is usable.

tms2003 avatar Nov 30 '23 07:11 tms2003

It needs some changes to support passing quantization method, I'll create a PR to support this later.

aresnow1 avatar Nov 30 '23 10:11 aresnow1

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Aug 08 '24 19:08 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

github-actions[bot] avatar Aug 13 '24 19:08 github-actions[bot]