text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

The EETQ quantization model cannot be performed locally

Open Gongzai-SURE opened this issue 10 months ago • 1 comments

System Info

hardware : 4090 PyTorch 2.1.0 Python 3.10(ubuntu22.04) Cuda 12.1

Information

  • [ ] Docker
  • [X] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id 'llama2-7b/llama2-7b-eetq' --num-shard 1 --quantize eetq

The model was quantified by the local EETQ process and saved by the 'save_pretrained()' instruction.

There should be no major problem with the dependencies on my machine, and I can feed the quantized models of gptq and awq into the TGI instruction for inference , only eetq quantized model didn't work.

Expected behavior

Error 1: cuda does not support 4-bit and 8-bit 图片

Error 2: transformers/quantizers/auto.py did not support eetq quantization type 图片 图片

Gongzai-SURE avatar Apr 25 '24 11:04 Gongzai-SURE

  1. TGI cannot load a quantized eetq model but quantize it after loading the fp16 model. It is tough to load a quantized eetq model in TGI since the quantized weights cannot be concatenated or split owing to the cutlass preprocess. It is possible to save a model by TGI and reuse it.
  2. You should use transformers via source code installation.

dtlzhuangz avatar May 10 '24 07:05 dtlzhuangz

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jun 12 '24 01:06 github-actions[bot]

It is possible to save a model by TGI and reuse it.

@dtlzhuangz, may I ask how to do that?

jiajinyu avatar Jul 02 '24 07:07 jiajinyu

It is possible to save a model by TGI and reuse it.

@dtlzhuangz, may I ask how to do that?

Sorry, we do not support loading a model in TGI. we only support loading a model in 'transformers'. (They are different since kernel fusion and tensor parallel of TGI will contribute to the quantized weight unable to be used from transformers. ).

dtlzhuangz avatar Jul 02 '24 08:07 dtlzhuangz

So, is there a way to load the quantized model in TGI somehow?

meitalbensinai avatar Jul 07 '24 11:07 meitalbensinai

@Narsil Hi, is it possible to save an EETQ quantized model and reuse it both in TGI?

dtlzhuangz avatar Jul 08 '24 05:07 dtlzhuangz