[WIP] Adding GPTQ support for llama

Open Narsil opened this issue 2 years ago • 3 comments

What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

May 02 '23 17:05 Narsil

Some interesting results https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/227 It seems GPTQ with group-size and act-order has a negative impact on inference performance.

May 15 '23 08:05 0x1997

performance

What kind ? PPL, yes, but usually it's acceptable. Latency ? No it doesn't in our prod. It actually helps quite a lot because there's a lot more RAM to work with (so queues times are highly decreased on half the hardware to run the model).

May 15 '23 13:05 Narsil

really looking forward for this feature to be available

May 24 '23 09:05 gsaivinay

Will get superseeded by :https://github.com/huggingface/text-generation-inference/pull/438

Jun 09 '23 16:06 Narsil

text-generation-inference text-generation-inference copied to clipboard

[WIP] Adding GPTQ support for llama

What does this PR do?

Before submitting

Who can review?

text-generation-inference
text-generation-inference copied to clipboard