text-generation-inference
text-generation-inference copied to clipboard
Inference support for GPTQ (llama + falcon tested) + Quantization script
Let's start discussing implementation.
- Need to expose the quantization scripts (either included here or add doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).
Currently it means that every place we use get_{tensor|sharded}
to check for quantization.
My idea is to reintegrate as much as possible into utils/layer.py
by expanding load_multi
to be a bit more generic.
This might require some thinking, but ultimately the qweight,qzeros,scales,g_idx
should be in a single place, and independant of bias presence.
What does this PR do?
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.