attention_sinks icon indicating copy to clipboard operation
attention_sinks copied to clipboard

GPTQ models support

Open synacktraa opened this issue 2 years ago • 5 comments

Can it handle GPTQ models like transformers library's AutoModelForCausalLM does?

synacktraa avatar Nov 20 '23 08:11 synacktraa

It's working without any problem but why the generation speed is slow compared non quantized models?

synacktraa avatar Nov 20 '23 09:11 synacktraa

Hello!

There shouldn't be any major changes in generation, but attention_sinks doesn't support flash attention in any of its models right now. Perhaps that's the difference in generation speed that you're experiencing?

  • Tom Aarsen

tomaarsen avatar Nov 20 '23 09:11 tomaarsen

Thanks for the fast response. Do you plan to work on it someday? I can implement it If you can explain flash attention a little bit.

synacktraa avatar Nov 20 '23 09:11 synacktraa

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

Minami-su avatar Jan 11 '24 08:01 Minami-su

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

Thankyou🙏

synacktraa avatar Jan 11 '24 19:01 synacktraa