llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

[FEATURE] Return attention_mask for GPTQ to work

Open casper-hansen opened this issue 2 years ago • 3 comments

🚀 Feature Request

I have been investigating how we can make GPTQ work in order to quantize MPT models. It seems that a lot of progress has been made already, however, there is one pain point which is the attention_mask being returned as None.

Motivation

When attention_mask is returned as None, other methods like GPTQ will not work due to the need for this attention_mask while quantizing the model. If it was to be implemented that foundry returns the attention_mask, it would enable AutoGPTQ to successfully quantize. I already tested this out myself and the response is coherent and reasonable: https://huggingface.co/casperhansen/mpt-7b-8k-chat-gptq

Implementation

Implementation is as simple as this: return (attn_bias, attention_mask)

Tests/other code may need to be modified.

casper-hansen avatar Jul 27 '23 09:07 casper-hansen

Hey @abhi-mosaic @vchiley, is it possible for one of you to look into this?

I was able to successfully quantize your MPT models to 4 bits with GPTQ with a simple edit to return the attention mask. It would be great if you could extend support through foundry and push it to hf.

https://huggingface.co/casperhansen/mpt-7b-8k-chat-gptq

casper-hansen avatar Aug 16 '23 10:08 casper-hansen

Hi @casper-hansen sorry for the delay on a response here. Conceptually this seems reasonable, although I don't know GPTQ enough to say whether or not this fully solves whatever issue you are running into. Would you be willing to make a PR for this?

dakinggg avatar Sep 15 '23 23:09 dakinggg

Would work on this

rajveer43 avatar Sep 29 '23 06:09 rajveer43