langchain Vicuna (Fine-tuned LLaMa)

It would be great to see LangChain wrap around Vicuna, a chat assistant fine-tuned from LLaMA on user-shared conversations.

Vicuna-13B is an open-source chatbot trained using user-shared conversations collected from ShareGPT. The chatbot has been evaluated using GPT-4. It has achieved more than 90% quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA https://github.com/hwchase17/langchain/issues/1473 and Stanford Alpaca https://github.com/hwchase17/langchain/issues/1777 in more than 90% of cases.

Useful links

Blog post: vicuna.lmsys.org
Training and servicing code: github.com/lm-sys/FastChat
Demo: chat.lmsys.org

Mar 31 '23 13:03 slavakurilyak

This should be possible using llama.cpp now that #2314 landed. I'm able to run vicuna locally with llama.cpp directly, but its hanging when using it with LangChain.

Apr 05 '23 12:04 oldsj

this model is absolutely extraordinary. I can't believe how smart like GPT4 is this one. The 4Bit version of HF works very very good, need to be onboarded.

Apr 05 '23 21:04 fblgit

It would be awesome to feed some local documents to Vicuna through LangChain. I'm keeping an eye on this.

Apr 05 '23 22:04 PetreVane

This model is definitely something else! Would appreciate proper support for it.

Apr 06 '23 15:04 lolxdmainkaisemaanlu

I've just started on my langchain odyssey and it would be awesome learn it with Vicuna. I have the 13b model running on my pc with https://github.com/oobabooga/text-generation-webui which also provides an api with examples.

Apr 07 '23 15:04 greggpatton

I managed to run Vicuna 13b using LLP API and used it in Langchain:

I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp

You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain

To run Vicuna:

First configure and run docker compose up the API as described here: https://github.com/1b5d/llm-api
Then you can simply make requests to it

curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "What is the capital of France?",
    "params": {
        ...
    }
}'

or you can play around with it using langchain via the lib

pip install langchain-llm-api

from langchain_llm_api import LLMAPI

llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.

Apr 08 '23 22:04 1b5d

fyi: you can load vicuna model through huggingface transformers by installing it from their git repo. then just load the tokenizer and model via LlamaTokenizer.from_pretrained(...) and LlamaForCausalLM.from_pretrained(...) and pass them to langchain HuggingFacePipeline. 4bit models can also be loaded via GPTQ-for-LLaMa.

so... the model is already supported, you just need to use transformers from git until they release a stable version. vicuna however does not play well with langchain... most of the default prompt templates are not suitable as vicuna fails to follow instructions and haven't managed to get reliable results.

Apr 09 '23 02:04 knoopx

Would be awesome to have Vicuna on-board in langchain integrations list.

Apr 10 '23 12:04 Bec-k

@BillSchumacher Interested to take tackle this?

Apr 10 '23 15:04 slavakurilyak

i'm happy to help with this one as well. vicuna's been by far the most promising local model i've tested to date.

Apr 10 '23 16:04 fblissjr

This should be relatively easy, later tonight I'm going to make it easier to start a conversation or multiple with another project. https://github.com/BillSchumacher/Auto-Vicuna

Apr 10 '23 16:04 BillSchumacher

I implemented plugins late last night, but I'm going to make it a bit more abstracted so it can integrate with anything easily.

Apr 10 '23 16:04 BillSchumacher

@BillSchumacher been watching your work in autogpt, you must know this stuff inside and out by now. excited to try it out later tonight - thanks!

Apr 10 '23 16:04 fblissjr

@slavakurilyak You can currently run Vicuna models using LlamaCpp if you're okay with CPU inference (I've tested both 7b and 13b models and they work great). Currenty there is no LlamaChat class in LangChain (though llama-cpp-python has a create_chat_completion method). Additionally prompt caching is an open issue (high priority but blocked) so inference is slower than it should be for chat conversations.

Apr 10 '23 17:04 abetlen

You can do GPU inference using my repo, or rather have to.

Apr 10 '23 18:04 BillSchumacher

Did you try llm-api for CPU inference? you can simply just run a docker container and expose the model thought a simple API. You can then use langchain-llm-api to add llm-api support to langchain - let me know your thoughts running Vicuna

p.s. thanks for @abetlen's llama-cpp-python

Apr 10 '23 19:04 1b5d

No I tried CPU inference and it was awful using LLaMA-Adapter

Apr 10 '23 19:04 BillSchumacher

Yeah, cpu inference is nowhere near the experience you can get with gpu, even at 4bit

Apr 10 '23 19:04 fblissjr

I added a function to get a one shot response to my repo. I have to wait for Fast-Chat to release so my upstream transformers compatibility fix is in their package.

From source though you could import it like

from auto_vicuna.chat import chat_one_shot

from auto_vicuna.conversation import make_conversation

Is the helper function to create a conversation.

Once I can release my package again I'll implement this unless you want to have a source requirement.

There's other stuff like loading the model too, but all that's pretty easy.

Apr 11 '23 06:04 BillSchumacher

fyi: you can load vicuna model through huggingface transformers by installing it from their git repo. then just load the tokenizer and model via LlamaTokenizer.from_pretrained(...) and LlamaForCausalLM.from_pretrained(...) and pass them to langchain HuggingFacePipeline. 4bit models can also be loaded via GPTQ-for-LLaMa.

so... the model is already supported, you just need to use transformers from git until they release a stable version. vicuna however does not play well with langchain... most of the default prompt templates are not suitable as vicuna fails to follow instructions and haven't managed to get reliable results.

Pipeline doesnt seem to work for Lora (PEFT). Does it work with GPTQ-for-LLama?

Apr 11 '23 07:04 gururise

And make it be able to help integrating user suggestions better would be nice so our feedback actually gets taken into account better by AutoGPT would be a revolution of a quality/effectiveness improvement with those models + enabling "free mode" or "low cost mode" or "everything the same but + using idle pc ressources as well!!!! 🤩

Apr 12 '23 08:04 GoMightyAlgorythmGo

I managed to run Vicuna 13b using LLP API and used it in Langchain:

I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp

You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain

To run Vicuna:

First configure and run docker compose up the API as described here: https://github.com/1b5d/llm-api

Then you can simply make requests to it
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "What is the capital of France?",
    "params": {
        ...
    }
}'
or you can play around with it using langchain via the lib
pip install langchain-llm-api
from langchain_llm_api import LLMAPI

llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.

I was wondering if your API supports an embedding interface? I am interested in building a local documentation chat application and would like to obtain the vectors through Vicuna. Would it be possible to provide some guidance or documentation on how to achieve this? Thank you very much for your time and assistance.

Apr 13 '23 03:04 bboymimi

I managed to run Vicuna 13b using LLP API and used it in Langchain: I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain To run Vicuna:

First configure and run docker compose up the API as described here: https://github.com/1b5d/llm-api

Then you can simply make requests to it
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "What is the capital of France?",
    "params": {
        ...
    }
}'
or you can play around with it using langchain via the lib
pip install langchain-llm-api
from langchain_llm_api import LLMAPI

llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.
I was wondering if your API supports an embedding interface? I am interested in building a local documentation chat application and would like to obtain the vectors through Vicuna. Would it be possible to provide some guidance or documentation on how to achieve this? Thank you very much for your time and assistance.

Yes there is an endpoint for getting embeddings, check out the readme of the API repo, it's also integrated in the langchain lib mentioned above and can be used directly

Apr 13 '23 06:04 1b5d

I tried running the agent example for langchain using Vicuna 7b and 13b, but the results were not good. Vicuna doesn't seem to follow the instructions well enough.

Apr 14 '23 04:04 BIGPPWONG

@BIGPPWONG how much prompting did you do?

I've just hacked a local setup for Vicuna-7B running on GPU (based on Hugging Face implementation, not llama.cpp) to work with Langchain ReAct agent, implementation here: https://github.com/paolorechia/learn-langchain/tree/main

I spent a couple of hours trying to make it fetch a random Chuck Norris joke for me... wasn't easy, I had to build some huge prompts... but eventually it worked. Here's the run: https://gist.github.com/paolorechia/0b8b5e08b38040e7ec10eef237caf3a5

I've never used the OpenAI models to build an agent to compare, but I'm guessing it's a lot easier to use, right?

Apr 18 '23 21:04 paolorechia

@paolorechia I only used the built-in prompt. your work looks promising.

Apr 19 '23 02:04 BIGPPWONG

https://github.com/AlenVelocity/langchain-llama

Apr 29 '23 02:04 shubham8550

@BIGPPWONG @paolorechia not sure if this has anything to do with Vicuna acting clunky, but could be something worth looking into:

Since this is instruction tuned, for best results, use the following format for inference (note that the instruction format is different from Alpaca):
### Human: your-prompt
### Assistant:

from the vicuna-13b-4bit model card.

May 01 '23 14:05 jacobhrussell

Follow up on the comments above: I've recently updated llm-api to be able to run Llama.cpp, GPTQ for Llama or a generic huggingface pipeline. You can easily switch between CPU and GPU for running Llama 2 for example

Jul 24 '23 18:07 1b5d

As there are now multiple implementations for Vicuna, I'm closing this issue

Oct 13 '23 12:10 slavakurilyak

langchain langchain copied to clipboard

Vicuna (Fine-tuned LLaMa)

langchain
langchain copied to clipboard