text-generation-inference
text-generation-inference copied to clipboard
The EETQ quantization model cannot be performed locally
System Info
hardware : 4090 PyTorch 2.1.0 Python 3.10(ubuntu22.04) Cuda 12.1
Information
- [ ] Docker
- [X] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id 'llama2-7b/llama2-7b-eetq' --num-shard 1 --quantize eetq
The model was quantified by the local EETQ process and saved by the 'save_pretrained()' instruction.
There should be no major problem with the dependencies on my machine, and I can feed the quantized models of gptq and awq into the TGI instruction for inference , only eetq quantized model didn't work.
Expected behavior
Error 1: cuda does not support 4-bit and 8-bit
Error 2: transformers/quantizers/auto.py did not support eetq quantization type
- TGI cannot load a quantized eetq model but quantize it after loading the fp16 model. It is tough to load a quantized eetq model in TGI since the quantized weights cannot be concatenated or split owing to the cutlass preprocess. It is possible to save a model by TGI and reuse it.
- You should use transformers via source code installation.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
It is possible to save a model by TGI and reuse it.
@dtlzhuangz, may I ask how to do that?
It is possible to save a model by TGI and reuse it.
@dtlzhuangz, may I ask how to do that?
Sorry, we do not support loading a model in TGI. we only support loading a model in 'transformers'. (They are different since kernel fusion and tensor parallel of TGI will contribute to the quantized weight unable to be used from transformers. ).
So, is there a way to load the quantized model in TGI somehow?
@Narsil Hi, is it possible to save an EETQ quantized model and reuse it both in TGI?