TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Feature Request: Quantized Mixtral

Open atyshka opened this issue 1 year ago • 1 comments

Glad to see Mixtral support in TensorRT-LLM! Unfortunately it doesn't seem to currently support AWQ with AMMO, as I get the following error with examples/quantization/quantize.py:

Traceback (most recent call last):
  File "/code/tensorrt_llm/examples/quantization/quantize.py", line 200, in <module>
    main()
  File "/code/tensorrt_llm/examples/quantization/quantize.py", line 192, in main
    model = quantize_and_export(model,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/quantized/ammo.py", line 111, in quantize_and_export
    raise NotImplementedError(
NotImplementedError: Deploying quantized model MixtralForCausalLM is not supported

I'm sure you're aware of this, but I just wanted to create an issue for tracking quantization support.

Side note, but is there a reason that TensorRT only supports do-it-yourself quantization and not pre-quantized models like the TheBloke produces at Huggingface? I can imagine a lot of users who want to use quantization lack the memory capacity for full models, and you need over 100GB of VRAM to quantize a model like Mixtral.

atyshka avatar Jan 30 '24 16:01 atyshka

vote +1

Alienfeel avatar Feb 02 '24 07:02 Alienfeel

yes please, support for pre-quantized models from HuggingFace would be great. i'm not even sure i can use multi-gpu setup for DIY quantization using TensorRT-LLM, as this file doesn't have such arguments: examples/quantization/quantize.py Also I was planning to use AWQ'd Mixtral as well when stumbled upon this issue.

larin92 avatar Mar 26 '24 15:03 larin92

Side note, but is there a reason that TensorRT only supports do-it-yourself quantization and not pre-quantized models like the TheBloke produces at Huggingface? I can imagine a lot of users who want to use quantization lack the memory capacity for full models, and you need over 100GB of VRAM to quantize a model like Mixtral.

Good question, this is a problem for me and it is something that I was wondering too.

In case it helps, I was able to quantize Mixtral 8x7B with GPTQ as I commented in https://github.com/NVIDIA/TensorRT-LLM/issues/1041#issuecomment-2018773287

tombolano avatar Mar 26 '24 18:03 tombolano