intel-extension-for-transformers [vLLM] optimizing vLLM models by qbits.

[vLLM] optimizing vLLM models by qbits.

Open Zhenzhong1 opened this issue 1 year ago • 0 comments

trafficstars

New feature & API change

[x] Complete the pipeline to replace part of vLLM linear modules by qbits linear. (chatglm2)
[ ] vLLM Integration API desgin: vllm_model = AutoModelForCausalLM.from_pretrained(args.model, use_vllm = True)
[ ] ITREX using pytorch==2.3.0 + cpu
[ ] extend acceleration to more models.

the expected behavior that triggered by this PR

how to reproduce the test (including hardware information)

any library dependency introduced or removed

May 15 '24 06:05 Zhenzhong1