llm-foundry
llm-foundry copied to clipboard
[FEATURE] Return attention_mask for GPTQ to work
🚀 Feature Request
I have been investigating how we can make GPTQ work in order to quantize MPT models. It seems that a lot of progress has been made already, however, there is one pain point which is the attention_mask being returned as None.
Motivation
When attention_mask is returned as None, other methods like GPTQ will not work due to the need for this attention_mask while quantizing the model. If it was to be implemented that foundry returns the attention_mask, it would enable AutoGPTQ to successfully quantize. I already tested this out myself and the response is coherent and reasonable: https://huggingface.co/casperhansen/mpt-7b-8k-chat-gptq
Implementation
Implementation is as simple as this: return (attn_bias, attention_mask)
Tests/other code may need to be modified.
Hey @abhi-mosaic @vchiley, is it possible for one of you to look into this?
I was able to successfully quantize your MPT models to 4 bits with GPTQ with a simple edit to return the attention mask. It would be great if you could extend support through foundry and push it to hf.
https://huggingface.co/casperhansen/mpt-7b-8k-chat-gptq
Hi @casper-hansen sorry for the delay on a response here. Conceptually this seems reasonable, although I don't know GPTQ enough to say whether or not this fully solves whatever issue you are running into. Would you be willing to make a PR for this?
Would work on this