llm-foundry
llm-foundry copied to clipboard
GPTQ support for quantization
Hi MosaicML.
AutoGPTQ is a package trying to provide support for quantizing various LLMs. However, to do so, a few requirements are needed.
Here are a few issues:
- MPTForCausalLM model currently does not return attention from all the hidden layers
- The class currently does not implement all the facets of the transformers library - e.g., it does not support
device_map=True
setting, nor does it support the boolean flagoutput_attentions
that can be set to obtain attentions mpt line 140
More on this specific issue can be found here: https://github.com/PanQiWei/AutoGPTQ/issues/69#issuecomment-1556499444
Is it possible to support the changed mentioned above?
Here is the specific line of code in this repository preventing packages like AutoGPTQ to quantize your MPT models: https://github.com/mosaicml/llm-foundry/blob/main/llmfoundry/models/mpt/modeling_mpt.py#L285
Thanks @casperbh96 , we are tracking this request. I believe @samhavens is discussing in the HuggingFace issue (https://huggingface.co/mosaicml/mpt-7b/discussions/30) currently.
@hanlint yes, my understanding is that this support would need us to output the attention matrices from the attention module, which is something that can't happen with flash attention; meaning we'd have to write it for the torch path only. We don't spend a lot of time on that path, but if it unblocks a lot of use cases, it shouldn't be too bad to add.
Want to confirm that this is the ask, @casperbh96 , and that you weren't under the impression that you'd also get flash attention.
output the attention matrices
Yes, the algorithm will need attention matrices for quantization.
with flash attention
I cannot give you a definitive answer here - the question is; will the quantization process work fine if flash attention is disabled while quantizing and enabled for inference afterward without any significant impact on performance? I would think yes, but I am not deep enough into AutoGPTQ to know if this is true.
Maybe @abhinavkulkarni or @PanQiWei can give their opinions on the question above?
@casperbh96 @abhinavkulkarni We are working on a PR which adds support for output_attentions
when using torch
attention; #210
For supporting device_map="auto"
, I believe the only change we need is to add _no_split_modules = ["MPTBlock"]
as a class property of MPTPreTrainedModel
?
Working on the device_map issue as part of #225
Working on the device_map issue as part of #225
Thank you, this should enable better compatibility with huggingface and enable more applications to be built on top of MPT!
Hi @casperbh96 , we have added device_map
support now in this repo as well as upgraded source code for the models uploaded on the HF Hub. Could you try seeing if quantization works now (with attn_impl=torch
)?
Hi @abhi-mosaic - development is still ongoing. The recent improvements in foundry have made it much more manageable.
A PR is open in AutoGPTQ where we are trying to quantize. There are some problems still, but I suspect it won't be too long before there is decent support for quantization.
One problem that has been highlighted is that the attention mask is not returned here: https://github.com/mosaicml/llm-foundry/blob/main/llmfoundry/models/mpt/modeling_mpt.py#L215
Closing this since it seems foundry added the required support for GPTQ to run, although GPTQ has not implemented quantization for MPT models yet.