attention_sinks GPTQ models support

GPTQ models support

Open synacktraa opened this issue 2 years ago • 5 comments

Can it handle GPTQ models like transformers library's AutoModelForCausalLM does?

Nov 20 '23 08:11 synacktraa

It's working without any problem but why the generation speed is slow compared non quantized models?

Nov 20 '23 09:11 synacktraa

Hello!

There shouldn't be any major changes in generation, but attention_sinks doesn't support flash attention in any of its models right now. Perhaps that's the difference in generation speed that you're experiencing?

Tom Aarsen

Nov 20 '23 09:11 tomaarsen

Thanks for the fast response. Do you plan to work on it someday? I can implement it If you can explain flash attention a little bit.

Nov 20 '23 09:11 synacktraa

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

Jan 11 '24 08:01 Minami-su

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

Thankyou🙏

Jan 11 '24 19:01 synacktraa

attention_sinks attention_sinks copied to clipboard

GPTQ models support

attention_sinks
attention_sinks copied to clipboard