Open-Assistant
Open-Assistant copied to clipboard
Add minimal support for GPTQ 4bit quantiation
It is based on a separate inference worker dockerfile, bash entry point, and inference server python script.
This is a draft PR with only a partial integration with the current OpenAssistant services as a minimal working PoC for 4bit quantized models. I will gladly accept some advice on better integrating it into the existing code base (there might be plenty to ask and my current approach is bare minimum to get it to run) but I wanted to first ask about everyone's feedback: Is this something that you think is even needed? Is it still good as a temporary solution until the transformers
library and bitsandbytes
provides native support for 4bit quantization? Are there alternative and possibly better ways to run 4bit models with OpenAssistant for those of us needing 30B+ parameters to try out plugins and tools but not having 60GB or more of RAM?
@andreaskoepf @yk and everyone potentially interested, let me know what you think.
:x: pre-commit failed.
Please run pre-commit run --all-files
locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files
locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
I think this is a totally reasonable approach in terms of getting it running. We need to consider whether this should be part of the main repo or not. It could be really useful if we get the inference setup to a point where it can be setup a bit more easily by casual users
I think this is a totally reasonable approach in terms of getting it running. We need to consider whether this should be part of the main repo or not. It could be really useful if we get the inference setup to a point where it can be setup a bit more easily by casual users
Then I could revert the docker-compose changes (to restore the default) and add a README about how to change the entrypoint in order to run in 4bit? At present all one has to do to run in 4bit is checkout the branch but of course this is destructive to the main repo's defaults and we could simply add the scripts then describe how to achieve an effect similar to the branch checkout.
Has bitsandbytes
4bit support now been integrated into OpenAssistant so we can close this or does anyone think it may still be of use?