Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Add minimal support for GPTQ 4bit quantiation

Open pevogam opened this issue 1 year ago • 5 comments

It is based on a separate inference worker dockerfile, bash entry point, and inference server python script.

This is a draft PR with only a partial integration with the current OpenAssistant services as a minimal working PoC for 4bit quantized models. I will gladly accept some advice on better integrating it into the existing code base (there might be plenty to ask and my current approach is bare minimum to get it to run) but I wanted to first ask about everyone's feedback: Is this something that you think is even needed? Is it still good as a temporary solution until the transformers library and bitsandbytes provides native support for 4bit quantization? Are there alternative and possibly better ways to run 4bit models with OpenAssistant for those of us needing 30B+ parameters to try out plugins and tools but not having 60GB or more of RAM?

@andreaskoepf @yk and everyone potentially interested, let me know what you think.

pevogam avatar May 25 '23 06:05 pevogam

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar May 25 '23 06:05 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar May 25 '23 07:05 github-actions[bot]

I think this is a totally reasonable approach in terms of getting it running. We need to consider whether this should be part of the main repo or not. It could be really useful if we get the inference setup to a point where it can be setup a bit more easily by casual users

olliestanley avatar Jun 02 '23 06:06 olliestanley

I think this is a totally reasonable approach in terms of getting it running. We need to consider whether this should be part of the main repo or not. It could be really useful if we get the inference setup to a point where it can be setup a bit more easily by casual users

Then I could revert the docker-compose changes (to restore the default) and add a README about how to change the entrypoint in order to run in 4bit? At present all one has to do to run in 4bit is checkout the branch but of course this is destructive to the main repo's defaults and we could simply add the scripts then describe how to achieve an effect similar to the branch checkout.

pevogam avatar Jun 02 '23 11:06 pevogam

Has bitsandbytes 4bit support now been integrated into OpenAssistant so we can close this or does anyone think it may still be of use?

pevogam avatar Oct 03 '23 17:10 pevogam