llm-foundry GPTQ support for quantization

Hi MosaicML.

AutoGPTQ is a package trying to provide support for quantizing various LLMs. However, to do so, a few requirements are needed.

Here are a few issues:

MPTForCausalLM model currently does not return attention from all the hidden layers
The class currently does not implement all the facets of the transformers library - e.g., it does not support device_map=True setting, nor does it support the boolean flag output_attentions that can be set to obtain attentions mpt line 140

More on this specific issue can be found here: https://github.com/PanQiWei/AutoGPTQ/issues/69#issuecomment-1556499444

Is it possible to support the changed mentioned above?

May 22 '23 11:05 casper-hansen

Here is the specific line of code in this repository preventing packages like AutoGPTQ to quantize your MPT models: https://github.com/mosaicml/llm-foundry/blob/main/llmfoundry/models/mpt/modeling_mpt.py#L285

May 22 '23 11:05 casper-hansen

Thanks @casperbh96 , we are tracking this request. I believe @samhavens is discussing in the HuggingFace issue (https://huggingface.co/mosaicml/mpt-7b/discussions/30) currently.

May 25 '23 03:05 hanlint

@hanlint yes, my understanding is that this support would need us to output the attention matrices from the attention module, which is something that can't happen with flash attention; meaning we'd have to write it for the torch path only. We don't spend a lot of time on that path, but if it unblocks a lot of use cases, it shouldn't be too bad to add.

Want to confirm that this is the ask, @casperbh96 , and that you weren't under the impression that you'd also get flash attention.

May 25 '23 08:05 samhavens

output the attention matrices

Yes, the algorithm will need attention matrices for quantization.

with flash attention

I cannot give you a definitive answer here - the question is; will the quantization process work fine if flash attention is disabled while quantizing and enabled for inference afterward without any significant impact on performance? I would think yes, but I am not deep enough into AutoGPTQ to know if this is true.

Maybe @abhinavkulkarni or @PanQiWei can give their opinions on the question above?

May 25 '23 14:05 casper-hansen

@casperbh96 @abhinavkulkarni We are working on a PR which adds support for output_attentions when using torch attention; #210

For supporting device_map="auto", I believe the only change we need is to add _no_split_modules = ["MPTBlock"] as a class property of MPTPreTrainedModel?

May 25 '23 21:05 samhavens

Working on the device_map issue as part of #225

May 31 '23 01:05 abhi-mosaic

Working on the device_map issue as part of #225

Thank you, this should enable better compatibility with huggingface and enable more applications to be built on top of MPT!

May 31 '23 12:05 casper-hansen

Hi @casperbh96 , we have added device_map support now in this repo as well as upgraded source code for the models uploaded on the HF Hub. Could you try seeing if quantization works now (with attn_impl=torch)?

Jun 13 '23 23:06 abhi-mosaic

Hi @abhi-mosaic - development is still ongoing. The recent improvements in foundry have made it much more manageable.

A PR is open in AutoGPTQ where we are trying to quantize. There are some problems still, but I suspect it won't be too long before there is decent support for quantization.

One problem that has been highlighted is that the attention mask is not returned here: https://github.com/mosaicml/llm-foundry/blob/main/llmfoundry/models/mpt/modeling_mpt.py#L215

Jun 15 '23 11:06 casper-hansen

Closing this since it seems foundry added the required support for GPTQ to run, although GPTQ has not implemented quantization for MPT models yet.

Jul 18 '23 09:07 casper-hansen

llm-foundry llm-foundry copied to clipboard

GPTQ support for quantization

llm-foundry
llm-foundry copied to clipboard