langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Vicuna (Fine-tuned LLaMa)

Open slavakurilyak opened this issue 1 year ago • 28 comments

It would be great to see LangChain wrap around Vicuna, a chat assistant fine-tuned from LLaMA on user-shared conversations.

Vicuna-13B is an open-source chatbot trained using user-shared conversations collected from ShareGPT. The chatbot has been evaluated using GPT-4. It has achieved more than 90% quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA https://github.com/hwchase17/langchain/issues/1473 and Stanford Alpaca https://github.com/hwchase17/langchain/issues/1777  in more than 90% of cases.

Useful links

  1. Blog post: vicuna.lmsys.org
  2. Training and servicing code: github.com/lm-sys/FastChat
  3. Demo: chat.lmsys.org

slavakurilyak avatar Mar 31 '23 13:03 slavakurilyak

This should be possible using llama.cpp now that #2314 landed. I'm able to run vicuna locally with llama.cpp directly, but its hanging when using it with LangChain.

oldsj avatar Apr 05 '23 12:04 oldsj

this model is absolutely extraordinary. I can't believe how smart like GPT4 is this one. The 4Bit version of HF works very very good, need to be onboarded.

fblgit avatar Apr 05 '23 21:04 fblgit

It would be awesome to feed some local documents to Vicuna through LangChain. I'm keeping an eye on this.

PetreVane avatar Apr 05 '23 22:04 PetreVane

This model is definitely something else! Would appreciate proper support for it.

lolxdmainkaisemaanlu avatar Apr 06 '23 15:04 lolxdmainkaisemaanlu

I've just started on my langchain odyssey and it would be awesome learn it with Vicuna. I have the 13b model running on my pc with https://github.com/oobabooga/text-generation-webui which also provides an api with examples.

greggpatton avatar Apr 07 '23 15:04 greggpatton

I managed to run Vicuna 13b using LLP API and used it in Langchain:

I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp

You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain

To run Vicuna:

  • First configure and run docker compose up the API as described here: https://github.com/1b5d/llm-api
  • Then you can simply make requests to it
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "What is the capital of France?",
    "params": {
        ...
    }
}'
  • or you can play around with it using langchain via the lib
pip install langchain-llm-api
from langchain_llm_api import LLMAPI

llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.

1b5d avatar Apr 08 '23 22:04 1b5d

fyi: you can load vicuna model through huggingface transformers by installing it from their git repo. then just load the tokenizer and model via LlamaTokenizer.from_pretrained(...) and LlamaForCausalLM.from_pretrained(...) and pass them to langchain HuggingFacePipeline. 4bit models can also be loaded via GPTQ-for-LLaMa.

so... the model is already supported, you just need to use transformers from git until they release a stable version. vicuna however does not play well with langchain... most of the default prompt templates are not suitable as vicuna fails to follow instructions and haven't managed to get reliable results.

knoopx avatar Apr 09 '23 02:04 knoopx

Would be awesome to have Vicuna on-board in langchain integrations list.

Bec-k avatar Apr 10 '23 12:04 Bec-k

@BillSchumacher Interested to take tackle this?

slavakurilyak avatar Apr 10 '23 15:04 slavakurilyak

i'm happy to help with this one as well. vicuna's been by far the most promising local model i've tested to date.

fblissjr avatar Apr 10 '23 16:04 fblissjr

This should be relatively easy, later tonight I'm going to make it easier to start a conversation or multiple with another project. https://github.com/BillSchumacher/Auto-Vicuna

BillSchumacher avatar Apr 10 '23 16:04 BillSchumacher

I implemented plugins late last night, but I'm going to make it a bit more abstracted so it can integrate with anything easily.

BillSchumacher avatar Apr 10 '23 16:04 BillSchumacher

@BillSchumacher been watching your work in autogpt, you must know this stuff inside and out by now. excited to try it out later tonight - thanks!

fblissjr avatar Apr 10 '23 16:04 fblissjr

@slavakurilyak You can currently run Vicuna models using LlamaCpp if you're okay with CPU inference (I've tested both 7b and 13b models and they work great). Currenty there is no LlamaChat class in LangChain (though llama-cpp-python has a create_chat_completion method). Additionally prompt caching is an open issue (high priority but blocked) so inference is slower than it should be for chat conversations.

abetlen avatar Apr 10 '23 17:04 abetlen

You can do GPU inference using my repo, or rather have to.

BillSchumacher avatar Apr 10 '23 18:04 BillSchumacher

Did you try llm-api for CPU inference? you can simply just run a docker container and expose the model thought a simple API. You can then use langchain-llm-api to add llm-api support to langchain - let me know your thoughts running Vicuna

p.s. thanks for @abetlen's llama-cpp-python

1b5d avatar Apr 10 '23 19:04 1b5d

No I tried CPU inference and it was awful using LLaMA-Adapter

BillSchumacher avatar Apr 10 '23 19:04 BillSchumacher

Yeah, cpu inference is nowhere near the experience you can get with gpu, even at 4bit

fblissjr avatar Apr 10 '23 19:04 fblissjr

I added a function to get a one shot response to my repo. I have to wait for Fast-Chat to release so my upstream transformers compatibility fix is in their package.

From source though you could import it like

from auto_vicuna.chat import chat_one_shot

image

from auto_vicuna.conversation import make_conversation

Is the helper function to create a conversation.

Once I can release my package again I'll implement this unless you want to have a source requirement.

There's other stuff like loading the model too, but all that's pretty easy.

BillSchumacher avatar Apr 11 '23 06:04 BillSchumacher

fyi: you can load vicuna model through huggingface transformers by installing it from their git repo. then just load the tokenizer and model via LlamaTokenizer.from_pretrained(...) and LlamaForCausalLM.from_pretrained(...) and pass them to langchain HuggingFacePipeline. 4bit models can also be loaded via GPTQ-for-LLaMa.

so... the model is already supported, you just need to use transformers from git until they release a stable version. vicuna however does not play well with langchain... most of the default prompt templates are not suitable as vicuna fails to follow instructions and haven't managed to get reliable results.

Pipeline doesnt seem to work for Lora (PEFT). Does it work with GPTQ-for-LLama?

gururise avatar Apr 11 '23 07:04 gururise

And make it be able to help integrating user suggestions better would be nice so our feedback actually gets taken into account better by AutoGPT would be a revolution of a quality/effectiveness improvement with those models + enabling "free mode" or "low cost mode" or "everything the same but + using idle pc ressources as well!!!! 🤩

GoMightyAlgorythmGo avatar Apr 12 '23 08:04 GoMightyAlgorythmGo

I managed to run Vicuna 13b using LLP API and used it in Langchain:

I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp

You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain

To run Vicuna:

  • First configure and run docker compose up the API as described here: https://github.com/1b5d/llm-api
  • Then you can simply make requests to it
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "What is the capital of France?",
    "params": {
        ...
    }
}'
  • or you can play around with it using langchain via the lib
pip install langchain-llm-api
from langchain_llm_api import LLMAPI

llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.

I was wondering if your API supports an embedding interface? I am interested in building a local documentation chat application and would like to obtain the vectors through Vicuna. Would it be possible to provide some guidance or documentation on how to achieve this? Thank you very much for your time and assistance.

bboymimi avatar Apr 13 '23 03:04 bboymimi

I managed to run Vicuna 13b using LLP API and used it in Langchain: I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain To run Vicuna:

  • First configure and run docker compose up the API as described here: https://github.com/1b5d/llm-api
  • Then you can simply make requests to it
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "What is the capital of France?",
    "params": {
        ...
    }
}'
  • or you can play around with it using langchain via the lib
pip install langchain-llm-api
from langchain_llm_api import LLMAPI

llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.

I was wondering if your API supports an embedding interface? I am interested in building a local documentation chat application and would like to obtain the vectors through Vicuna. Would it be possible to provide some guidance or documentation on how to achieve this? Thank you very much for your time and assistance.

Yes there is an endpoint for getting embeddings, check out the readme of the API repo, it's also integrated in the langchain lib mentioned above and can be used directly

1b5d avatar Apr 13 '23 06:04 1b5d

I tried running the agent example for langchain using Vicuna 7b and 13b, but the results were not good. Vicuna doesn't seem to follow the instructions well enough.

BIGPPWONG avatar Apr 14 '23 04:04 BIGPPWONG

@BIGPPWONG how much prompting did you do?

I've just hacked a local setup for Vicuna-7B running on GPU (based on Hugging Face implementation, not llama.cpp) to work with Langchain ReAct agent, implementation here: https://github.com/paolorechia/learn-langchain/tree/main

I spent a couple of hours trying to make it fetch a random Chuck Norris joke for me... wasn't easy, I had to build some huge prompts... but eventually it worked. Here's the run: https://gist.github.com/paolorechia/0b8b5e08b38040e7ec10eef237caf3a5

I've never used the OpenAI models to build an agent to compare, but I'm guessing it's a lot easier to use, right?

paolorechia avatar Apr 18 '23 21:04 paolorechia

@paolorechia I only used the built-in prompt. your work looks promising.

BIGPPWONG avatar Apr 19 '23 02:04 BIGPPWONG

https://github.com/AlenVelocity/langchain-llama

shubham8550 avatar Apr 29 '23 02:04 shubham8550

@BIGPPWONG @paolorechia not sure if this has anything to do with Vicuna acting clunky, but could be something worth looking into:

Since this is instruction tuned, for best results, use the following format for inference (note that the instruction format is different from Alpaca):

### Human: your-prompt
### Assistant:

from the vicuna-13b-4bit model card.

jacobhrussell avatar May 01 '23 14:05 jacobhrussell

Follow up on the comments above: I've recently updated llm-api to be able to run Llama.cpp, GPTQ for Llama or a generic huggingface pipeline. You can easily switch between CPU and GPU for running Llama 2 for example

1b5d avatar Jul 24 '23 18:07 1b5d

As there are now multiple implementations for Vicuna, I'm closing this issue

slavakurilyak avatar Oct 13 '23 12:10 slavakurilyak