inference
inference copied to clipboard
QUESTION: How to use both gptq and vllm in qwen-14b model?
The vllm version already supports gptq quantization of qwen 14b-4bit (in the quantization branch), so how do I use them in xinference?
We need to test it, some adaptation work may be required.
It seems vllm's support for gptq is in progress(https://github.com/vllm-project/vllm/pull/1580), how did you use vllm with GPTQ?
@aresnow1 The branch code of vllm can be found at https://github.com/chu-tianxiang/vllm-gptq/. It is a branch from vllm. All questions can be asked on the main branch of vllm, so it is the correct official one. The code, but due to many conflicts in the code, was not merged into the main branch. After testing, its code is usable.
It needs some changes to support passing quantization method, I'll create a PR to support this later.
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.