Mark Schmidt

Results 95 comments of Mark Schmidt

Leaving 1 or 2 layers fully uncompressed has very good results in the literature, from what I recall. This is a good idea to test.

Dropping this here for those who don't know: You can serve any model as an OpenAI API compatible API endpoint with Basaran: [https://github.com/hyperonym/basaran](https://github.com/hyperonym/basaran) "Basaran is an open-source alternative to the...

Vicuna-13B and other 13B fine tunes in 4bit are only 8GB and even run purely on CPU at useful speeds. "Open Assistant LLaMA-13B" is also highly capable, similar to Vicuna....

> > @MarkSchmidty yeah this has to happen eventually. for the embeddings: > > > > * [tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder](https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder) > > > > another option is also no embeddings at all,...

I think you'll find you need Pascal (P40) or newer to run models in 4bit.

The open PR #2594 would resolve this issue for LLaMA based models and go a long way towards supporting all models.  It adds a a configurable API server URL and...

> "offline API" is a recurring topic here, and some other folks mentioned the lack of "learning", where Auto-GPT keeps looking for information (thinking) that it should already have. These...

> Can we expect that this forthcoming dataset declaration will include those inputs that imbue this model with politically correct output (even with a neutral SYSTEM prompt) ? Only the...

LLaMA-7B can be run on CPU instead of GPU using this fork of the LLaMA repo: https://github.com/markasoftware/llama-cpu To quote the author "On a Ryzen 7900X, the 7B model is able...