langchain
langchain copied to clipboard
LlaMa
It would be great to see LangChain integrate with LlaMa, a collection of foundation language models ranging from 7B to 65B parameters.
LlaMa is a language model that was developed to improve upon existing models such as ChatGPT and GPT-3. It has several advantages over these models, such as improved accuracy, faster training times, and more robust handling of out-of-vocabulary words. LlaMa is also more efficient in terms of memory usage and computational resources. In terms of accuracy, LlaMa outperforms ChatGPT and GPT-3 on several natural language understanding tasks, including sentiment analysis, question answering, and text summarization. Additionally, LlaMa can be trained on larger datasets, enabling it to better capture the nuances of natural language. Overall, LlaMa is a more powerful and efficient language model than ChatGPT and GPT-3.
Here's the official repo by @facebookresearch. Here's the research abstract and PDF, respectively.
Note, this project is not to be confused with LlamaIndex (previously GPT Index) by @jerryjliu.
@conceptofmind i believe you said you were working on this?
@conceptofmind i believe you said you were working on this?
Yes actively working on this with a group of peers. We have successfully deployed inference with the 65B models. Working on a LangChain wrapper now.
Would have to think about how to handle the sizes of different models though. I could see this becoming an issue for the end user.......
There is some ongoing work to use GPTQ to compress the models to 3 or 4 bits in this repo. Also a discussion going on over at the oobabooga repo.
Not sure if this is going to work but might be something to keep an eye on. If it works out it could be possible to run the larger models on a single consumer grade GPU.
The original paper is available here on arxiv.
4 bit may be plausible. 8 bit should be fine. The weights are already in fp16 from my understanding. I would have to evaluate this further.
Yes, the weights are fp16. You can convert and run 4-bit using https://github.com/ggerganov/llama.cpp. I think 30B with full precision might be at least on par to 65B 4-bit in case of results. Llama.cpp runs on CPU, including Apple Silicon, which might be a good choice for developers with recent Macbooks, they could develop and run experiments locally with langchain without a need of GPUs.
There is some ongoing work to use GPTQ to compress the models to 3 or 4 bits in this repo. Also a discussion going on over at the oobabooga repo.
Not sure if this is going to work but might be something to keep an eye on. If it works out it could be possible to run the larger models on a single consumer grade GPU.
The original paper is available here on arxiv.
Confirmed working on a single consumer grade 4090 here with 13B. Waiting on the 30B 4 bit weights - failed at trying to run them at fp16. :)
I am aware of all these alternatives. We are waiting to hear back from Huggingface before the decision is made. Once we have a concrete answer from them we will proceed from there.
I have some concerns about Llama.cpp since the author seems to have noted he has no interest in maintaining it. And there are other things to factor in when adding dependencies that can not be easily installed. It needs to be a relatively effortless setup for the best user experience.
Using GPTQ 4-bit quantized 30B model, outputs are (as far as I can tell) very good. Hope to see GPTQ 4-bit support in LangChain. The GPTQ quantization appears to be better than the 4-bit RTN quantization (currently) used in Llama.cpp
4-bit 30B model confirmed working on an OLD Tesla P40 GPU (24GB).
Any info on running 7B model with Langchain?
Yes, the weights are fp16. You can convert and run 4-bit using https://github.com/ggerganov/llama.cpp. I think 30B with full precision might be at least on par to 65B 4-bit in case of results. Llama.cpp runs on CPU, including Apple Silicon, which might be a good choice for developers with recent Macbooks, they could develop and run experiments locally with langchain without a need of GPUs.
It'd be really neat if that's going to be an option :smile: Sure it's slow but hey you can run it on a literal laptop.
Llama has been added to Huggingface: https://github.com/huggingface/transformers/pull/21955
The only reason to add a specific wrapper would be to include the perf improvements from cpp or gptq
I think you are talking about a Pythion wrapper. So I'm going to write a TS wrapper for llama.cpp and alpaca.cpp for localhost private usage, if no one is working on this yet.
I will try extend the class BaseLLM to do so.
Here you are:
https://github.com/linonetwo/langchain-alpaca
https://www.npmjs.com/package/langchain-alpaca
works on all platforms and works fully locally.
For now, I will try to make a langchain-llama package.
I'm eagerly waiting to try it for a project :D !!!
If anyone's interested, I've made a pass at wrapping the llama.cpp shared library using ctypes and deriving a custom LLM class for it. https://gist.github.com/asgeir/3dd75109133b218bf62bab5ddfcbb387
FYI: I just submitted this pull request to integrate llama.cpp into langchain: https://github.com/hwchase17/langchain/pull/2242
FYI: I just submitted this pull request to integrate llama.cpp into langchain: #2242
Thank you very much!!
Do you think it would be possible to run LLaMA on GPU as well somehow?
FYI: I just submitted this pull request to integrate llama.cpp into langchain: #2242
Thank you very much!!
Do you think it would be possible to run LLaMA on GPU as well somehow?
You are able to load Llama in through Huggingface and use it in a GPU-accelerated environment. https://huggingface.co/docs/transformers/main/en/model_doc/llama
I also added Kobold/text-generation-webui support so you can run Llama or whatever you want locally. I only tested it a bit, but it worked well back when I made it. I didn't intend on making a PR or maintaining it though, so anyone can feel free to take it and hack on it: https://github.com/hwchase17/langchain/compare/master...kooshi:langchain:kobold-api
I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain
To run it:
- First configure and run
docker compose upthe API as described here: https://github.com/1b5d/llm-api - Then you can simply make requests to it
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "What is the capital of France?",
"params": {
...
}
}'
- or you can play around with it using langchain via the lib
pip install langchain-llm-api
from langchain_llm_api import LLMAPI
llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.
I also added Kobold/text-generation-webui support so you can run Llama or whatever you want locally. I only tested it a bit, but it worked well back when I made it. I didn't intend on making a PR or maintaining it though, so anyone can feel free to take it and hack on it: master...kooshi:langchain:kobold-api
did you happen to test this with https://github.com/oobabooga/text-generation-webui ? haven't dug into kobold enough to know if the APIs are similar enough
Hi, @slavakurilyak! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, this issue is a request for LangChain to integrate with LlaMa, a more powerful and efficient language model developed by Facebook Research. There has been ongoing work to use GPTQ to compress the models to 3 or 4 bits, and there has been a discussion about running LlaMa on GPUs. Additionally, a Python wrapper for llama.cpp has been created, and there are plans to create a TS wrapper as well. It's worth mentioning that Llama has been added to Huggingface, and there are other alternatives like Kobold/text-generation-webui and langchain-llm-api.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository, and please don't hesitate to reach out if you have any further questions or concerns!
Best regards, Dosu