text-generation-inference The EETQ quantization model cannot be performed locally

The EETQ quantization model cannot be performed locally

Open Gongzai-SURE opened this issue 10 months ago • 1 comments

System Info

hardware : 4090 PyTorch 2.1.0 Python 3.10(ubuntu22.04) Cuda 12.1

Information

[ ] Docker
[X] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id 'llama2-7b/llama2-7b-eetq' --num-shard 1 --quantize eetq

The model was quantified by the local EETQ process and saved by the 'save_pretrained()' instruction.

There should be no major problem with the dependencies on my machine, and I can feed the quantized models of gptq and awq into the TGI instruction for inference , only eetq quantized model didn't work.

Expected behavior

Error 1: cuda does not support 4-bit and 8-bit

Error 2: transformers/quantizers/auto.py did not support eetq quantization type

Apr 25 '24 11:04 Gongzai-SURE

TGI cannot load a quantized eetq model but quantize it after loading the fp16 model. It is tough to load a quantized eetq model in TGI since the quantized weights cannot be concatenated or split owing to the cutlass preprocess. It is possible to save a model by TGI and reuse it.
You should use transformers via source code installation.

May 10 '24 07:05 dtlzhuangz

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jun 12 '24 01:06 github-actions[bot]

It is possible to save a model by TGI and reuse it.

@dtlzhuangz, may I ask how to do that?

Jul 02 '24 07:07 jiajinyu

It is possible to save a model by TGI and reuse it.

@dtlzhuangz, may I ask how to do that?

Sorry, we do not support loading a model in TGI. we only support loading a model in 'transformers'. (They are different since kernel fusion and tensor parallel of TGI will contribute to the quantized weight unable to be used from transformers. ).

Jul 02 '24 08:07 dtlzhuangz

So, is there a way to load the quantized model in TGI somehow?

Jul 07 '24 11:07 meitalbensinai

@Narsil Hi, is it possible to save an EETQ quantized model and reuse it both in TGI?

Jul 08 '24 05:07 dtlzhuangz

text-generation-inference text-generation-inference copied to clipboard

The EETQ quantization model cannot be performed locally

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard