langchain icon indicating copy to clipboard operation
langchain copied to clipboard

LlaMa

Open slavakurilyak opened this issue 2 years ago • 16 comments

It would be great to see LangChain integrate with LlaMa, a collection of foundation language models ranging from 7B to 65B parameters.

LlaMa is a language model that was developed to improve upon existing models such as ChatGPT and GPT-3. It has several advantages over these models, such as improved accuracy, faster training times, and more robust handling of out-of-vocabulary words. LlaMa is also more efficient in terms of memory usage and computational resources. In terms of accuracy, LlaMa outperforms ChatGPT and GPT-3 on several natural language understanding tasks, including sentiment analysis, question answering, and text summarization. Additionally, LlaMa can be trained on larger datasets, enabling it to better capture the nuances of natural language. Overall, LlaMa is a more powerful and efficient language model than ChatGPT and GPT-3.

Here's the official repo by @facebookresearch. Here's the research abstract and PDF, respectively.

Note, this project is not to be confused with LlamaIndex (previously GPT Index) by @jerryjliu.

slavakurilyak avatar Mar 06 '23 17:03 slavakurilyak

@conceptofmind i believe you said you were working on this?

hwchase17 avatar Mar 06 '23 18:03 hwchase17

@conceptofmind i believe you said you were working on this?

Yes actively working on this with a group of peers. We have successfully deployed inference with the 65B models. Working on a LangChain wrapper now.

conceptofmind avatar Mar 06 '23 18:03 conceptofmind

Would have to think about how to handle the sizes of different models though. I could see this becoming an issue for the end user.......

conceptofmind avatar Mar 06 '23 18:03 conceptofmind

There is some ongoing work to use GPTQ to compress the models to 3 or 4 bits in this repo. Also a discussion going on over at the oobabooga repo.

Not sure if this is going to work but might be something to keep an eye on. If it works out it could be possible to run the larger models on a single consumer grade GPU.

The original paper is available here on arxiv.

Electomanic avatar Mar 06 '23 20:03 Electomanic

4 bit may be plausible. 8 bit should be fine. The weights are already in fp16 from my understanding. I would have to evaluate this further.

conceptofmind avatar Mar 06 '23 22:03 conceptofmind

Yes, the weights are fp16. You can convert and run 4-bit using https://github.com/ggerganov/llama.cpp. I think 30B with full precision might be at least on par to 65B 4-bit in case of results. Llama.cpp runs on CPU, including Apple Silicon, which might be a good choice for developers with recent Macbooks, they could develop and run experiments locally with langchain without a need of GPUs.

jooray avatar Mar 12 '23 12:03 jooray

There is some ongoing work to use GPTQ to compress the models to 3 or 4 bits in this repo. Also a discussion going on over at the oobabooga repo.

Not sure if this is going to work but might be something to keep an eye on. If it works out it could be possible to run the larger models on a single consumer grade GPU.

The original paper is available here on arxiv.

Confirmed working on a single consumer grade 4090 here with 13B. Waiting on the 30B 4 bit weights - failed at trying to run them at fp16. :)

fblissjr avatar Mar 12 '23 17:03 fblissjr

I am aware of all these alternatives. We are waiting to hear back from Huggingface before the decision is made. Once we have a concrete answer from them we will proceed from there.

I have some concerns about Llama.cpp since the author seems to have noted he has no interest in maintaining it. And there are other things to factor in when adding dependencies that can not be easily installed. It needs to be a relatively effortless setup for the best user experience.

conceptofmind avatar Mar 12 '23 17:03 conceptofmind

Using GPTQ 4-bit quantized 30B model, outputs are (as far as I can tell) very good. Hope to see GPTQ 4-bit support in LangChain. The GPTQ quantization appears to be better than the 4-bit RTN quantization (currently) used in Llama.cpp

4-bit 30B model confirmed working on an OLD Tesla P40 GPU (24GB).

gururise avatar Mar 13 '23 16:03 gururise

Any info on running 7B model with Langchain?

DamascusGit avatar Mar 14 '23 03:03 DamascusGit

Yes, the weights are fp16. You can convert and run 4-bit using https://github.com/ggerganov/llama.cpp. I think 30B with full precision might be at least on par to 65B 4-bit in case of results. Llama.cpp runs on CPU, including Apple Silicon, which might be a good choice for developers with recent Macbooks, they could develop and run experiments locally with langchain without a need of GPUs.

It'd be really neat if that's going to be an option :smile: Sure it's slow but hey you can run it on a literal laptop.

niansa avatar Mar 15 '23 18:03 niansa

Llama has been added to Huggingface: https://github.com/huggingface/transformers/pull/21955

The only reason to add a specific wrapper would be to include the perf improvements from cpp or gptq

conceptofmind avatar Mar 16 '23 15:03 conceptofmind

I think you are talking about a Pythion wrapper. So I'm going to write a TS wrapper for llama.cpp and alpaca.cpp for localhost private usage, if no one is working on this yet.

I will try extend the class BaseLLM to do so.

linonetwo avatar Mar 19 '23 09:03 linonetwo

Here you are:

https://github.com/linonetwo/langchain-alpaca

https://www.npmjs.com/package/langchain-alpaca

works on all platforms and works fully locally.

For now, I will try to make a langchain-llama package.

linonetwo avatar Mar 20 '23 18:03 linonetwo

I'm eagerly waiting to try it for a project :D !!!

wiz64 avatar Mar 22 '23 18:03 wiz64

If anyone's interested, I've made a pass at wrapping the llama.cpp shared library using ctypes and deriving a custom LLM class for it. https://gist.github.com/asgeir/3dd75109133b218bf62bab5ddfcbb387

asgeir avatar Mar 24 '23 09:03 asgeir

FYI: I just submitted this pull request to integrate llama.cpp into langchain: https://github.com/hwchase17/langchain/pull/2242

rjadr avatar Mar 31 '23 19:03 rjadr

FYI: I just submitted this pull request to integrate llama.cpp into langchain: #2242

Thank you very much!!

Do you think it would be possible to run LLaMA on GPU as well somehow?

juanps90 avatar Apr 05 '23 18:04 juanps90

FYI: I just submitted this pull request to integrate llama.cpp into langchain: #2242

Thank you very much!!

Do you think it would be possible to run LLaMA on GPU as well somehow?

You are able to load Llama in through Huggingface and use it in a GPU-accelerated environment. https://huggingface.co/docs/transformers/main/en/model_doc/llama

conceptofmind avatar Apr 05 '23 18:04 conceptofmind

I also added Kobold/text-generation-webui support so you can run Llama or whatever you want locally. I only tested it a bit, but it worked well back when I made it. I didn't intend on making a PR or maintaining it though, so anyone can feel free to take it and hack on it: https://github.com/hwchase17/langchain/compare/master...kooshi:langchain:kobold-api

kooshi avatar Apr 05 '23 22:04 kooshi

I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain

To run it:

  • First configure and run docker compose up the API as described here: https://github.com/1b5d/llm-api
  • Then you can simply make requests to it
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "What is the capital of France?",
    "params": {
        ...
    }
}'
  • or you can play around with it using langchain via the lib
pip install langchain-llm-api
from langchain_llm_api import LLMAPI

llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.

1b5d avatar Apr 08 '23 22:04 1b5d

I also added Kobold/text-generation-webui support so you can run Llama or whatever you want locally. I only tested it a bit, but it worked well back when I made it. I didn't intend on making a PR or maintaining it though, so anyone can feel free to take it and hack on it: master...kooshi:langchain:kobold-api

did you happen to test this with https://github.com/oobabooga/text-generation-webui ? haven't dug into kobold enough to know if the APIs are similar enough

fblissjr avatar Apr 10 '23 16:04 fblissjr

Hi, @slavakurilyak! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, this issue is a request for LangChain to integrate with LlaMa, a more powerful and efficient language model developed by Facebook Research. There has been ongoing work to use GPTQ to compress the models to 3 or 4 bits, and there has been a discussion about running LlaMa on GPUs. Additionally, a Python wrapper for llama.cpp has been created, and there are plans to create a TS wrapper as well. It's worth mentioning that Llama has been added to Huggingface, and there are other alternatives like Kobold/text-generation-webui and langchain-llm-api.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository, and please don't hesitate to reach out if you have any further questions or concerns!

Best regards, Dosu

dosubot[bot] avatar Sep 22 '23 16:09 dosubot[bot]