Mark Schmidt
Mark Schmidt
Leaving 1 or 2 layers fully uncompressed has very good results in the literature, from what I recall. This is a good idea to test.
Dropping this here for those who don't know: You can serve any model as an OpenAI API compatible API endpoint with Basaran: [https://github.com/hyperonym/basaran](https://github.com/hyperonym/basaran) "Basaran is an open-source alternative to the...
Vicuna-13B and other 13B fine tunes in 4bit are only 8GB and even run purely on CPU at useful speeds. "Open Assistant LLaMA-13B" is also highly capable, similar to Vicuna....
> > @MarkSchmidty yeah this has to happen eventually. for the embeddings: > > > > * [tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder](https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder) > > > > another option is also no embeddings at all,...
I think you'll find you need Pascal (P40) or newer to run models in 4bit.
To my knowledge, yes.
The open PR #2594 would resolve this issue for LLaMA based models and go a long way towards supporting all models. It adds a a configurable API server URL and...
> "offline API" is a recurring topic here, and some other folks mentioned the lack of "learning", where Auto-GPT keeps looking for information (thinking) that it should already have. These...
> Can we expect that this forthcoming dataset declaration will include those inputs that imbue this model with politically correct output (even with a neutral SYSTEM prompt) ? Only the...
LLaMA-7B can be run on CPU instead of GPU using this fork of the LLaMA repo: https://github.com/markasoftware/llama-cpu To quote the author "On a Ryzen 7900X, the 7B model is able...