langchain
langchain copied to clipboard
Alpaca (Fine-tuned LLaMA)
It would be great to see LangChain integrate with Standford's Alpaca 7B model, a fine-tuned LlaMa (see https://github.com/hwchase17/langchain/issues/1473).
Standford created an AI able to generate outputs that were largely on par with OpenAI’s text-davinci-003
and regularly better than GPT-3 — all for a fraction of the computing power and price.
Stanford's Alpaca is a language model that was fine-tuned from Meta's LLaMA with 52,000 instructions generated by GPT-3.5[1]. The researchers used AI-generated instructions to train Alpaca 7B, which exhibits many GPT-3.5-like behaviors[1]. In a blind test using input from the Self-Instruct Evaluation Set, both models performed comparably[1][2]. However, Alpaca has problems common to other language models such as hallucinations, toxicity, and stereotyping[1]. The team is releasing an interactive demo, the training dataset, and the training code for research purposes[1][4].
Alpaca is still under development and has limitations that need to be addressed. The researchers have not yet fine-tuned the model to be safe and harmless[2]. They encourage users to be cautious when interacting with Alpaca and report any concerning behavior to help improve it[2].
LLaMA is a new open-source language model from Meta Research that performs as well as closed-source models. Stanford's Alpaca is a fine-tuned version of LLaMA that can respond to instructions like ChatGPT[3]. It functions more like a fancy version of autocomplete than a conversational bot[3].
The researchers are releasing their findings about an instruction-following language model dubbed Alpaca. They trained the Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003
. On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAI’s text-davinci-003
but is also surprisingly small and easy/cheap to reproduce[4].
Also worth considering Instruct-tuned LLaMA models that can run on consumer hardware, such as https://github.com/tloen/alpaca-lora.
A basic LangChainJS package for Alpaca is here https://github.com/linonetwo/langchain-alpaca
A basic LangChainJS package for Alpaca is here https://github.com/linonetwo/langchain-alpaca
Great!! but would be nicer if there's a python wrapper
I hacked together a working Python LLM class based on Serge, which is MIT licensed: https://gist.github.com/lukestanley/6517823485f88a40a09979c1a19561ce I am new to LangChain and very new to making LLM classes, and streaming is complicated. But this works for each streaming situation I tried. Someone more familiar could probably do a much better job sooner than me but as there is a lot of interest in breaking free of the cloud and finally being able to work locally I figured I'd put this out now before polishing it nicely. @iRanadheer @slavakurilyak
I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain
To run Alpaca:
- First configure and run
docker compose up
the API as described here: https://github.com/1b5d/llm-api - Then you can simply make requests to it
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "What is the capital of France?",
"params": {
...
}
}'
- or you can play around with it using langchain via the lib
pip install langchain-llm-api
from langchain_llm_api import LLMAPI
llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.
This is interesting, thanks for sharing it @1b5d I blinked and now there is a Python wrapper for llama.cpp! To best help most people here, it would be ideal to improve LangChain to support Alpaca / Llama without adding a dependency on Docker to LangChain or running a web server.
llama-cpp-python seems like a good wrapper around llama.cpp and could be wrapped with a class that implements the LLM class with streaming methods like my gist does but without the class itself running the command directly, so that there is no Docker or web server needed, it could have less dependencies to run.
I've seen the good work integrating my gist done on the alpaca branch by @robertschulze and @hwchase17: https://github.com/hwchase17/langchain/compare/harrison/alpaca
The paths look tricky, with remote models we don't have to think about file paths very often. But often have to think about the API keys. The OpenAI module checks for a well known environment variable first. Are there places where this project specifies file paths anywhere in a nice way with user overrides? The code needs a path to the LLM model and path to llama.cpp executable. Ideally we wouldn't have to think about paths for anything but we need to know where the model is until we're allowed to download ourselves (which is tricky license wise). We could probably eliminate needing to know where llama.cpp is, if we built it as part of the project but that might be more tricky (or maybe not if llama-cpp-python does that). I suppose the default thing to do would to have it added to the PATH shell variable.
Actually, looking at llama-cpp-python, it introduces an env var called "MODEL", which seems a bit too generic but may be okay as a first pass. It would be good if it was prebuilt for my platform (x86) but I think this is better than my gist. It built correctly on my system first time (Ubuntu with build-essential etc already installed). One thing that is much better than my gist is the fact that it does not need to reload for each call!
I did some work on it here! Not sure it's PR worthy yet.
There are use case for both local and remote model inference I believe, I want to run my models on a remote server, while others might have enough hardware power on their local machines
I realised there is a Llamma class already integrated. Better put something in my eyes so I don't need to blink! I have a draft PR that adds streaming here: https://github.com/hwchase17/langchain/pull/2775
there's alpaca on huggingface, i think you can do something like this:
llm = HuggingFaceHub(repo_id="chavinlo/alpaca-native", model_kwargs={"temperature":0, "max_length":64})
there's alpaca on huggingface, i think you can do something like this:
llm = HuggingFaceHub(repo_id="chavinlo/alpaca-native", model_kwargs={"temperature":0, "max_length":64})
But this would not run the model locally.
you can use the huggingfacepipeline class., something like this
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline
model_path = 'your_local_model_path'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
alpaca_pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
local_llm = HuggingFacePipeline(pipeline=alpaca_pipe)
:clap: Thank you @iRanadheer - Information on how to integrate local models with langchain is sparse. This little tidbit is helpful. If you know of any repos that provide toy examples of such integration, please post a link!
The LangChain repo has documentation on how to use LlamaCpp within LangChain with a LLM class like the OpenAI one. The LlamaCpp LLM class is probably faster and easier to setup for most people, and is documented here: https://python.langchain.com/en/latest/modules/models/llms/integrations/llamacpp.html and here: https://python.langchain.com/en/latest/ecosystem/llamacpp.html Like the underlying code, it can generate with Alpaca models too. I am working on adding streaming support to it here. This is happening fast and it might be hard to find, I think this issue might deserve closing soon since it works. I'll probably make a gif of it working with a funny example. Any ideas on other ways it could be more visible? @tensiondriven
@lukestanley Thanks for asking!
I saw the LlamaCpp example, but my understanding is that the -Cpp interface/classes are very different from huggingface/pt/safetensors.
I agree that Cpp is more accessible for more people, but it doesn't address what I think is a pretty common use case: Using HF-format models locally for GPU inference with CUDA/Triton.
When I read the docs, I saw the bit about "integrating with runway, and also local" (was it called runway? I'm not sure, I'm on my phone at the moment)
What would have made it more clear is if there was a specific page for "Local LLM" that gives an example which takes a path to a safetensors file and includes setup of the required dependencies (pytorch/fromCasusalLangaugModel, etc.)
Bonus points if it includes an example for loading with GPTQ 4-bit, as a lot of people need to run 4-bit locally due to limited VRAM.
Extra bonus points if it includes an example of loading a safetensors model with a Lora.
I realize these implementation details are not specifically a concern of langchain, however given how fiddly the python deps are, I still think it would be useful to show a full example of standing up a GPU-based local model using a safetensors model as its own page in the docs under LLM's.
As a side-note, there are a few bugs in the chat-demo project, which there are fixes for in the GitHub Issues, but those small bugs do add significant friction for people who want to get langchain up and running (with either OpenAI or a different/local LLM).
For one additional bit of context, I think a lot of people are coming from using text-generation-webui, which, while being pretty poorly organized code, at least gives people exposure to the python dependency stack and some of the parameters used in loading language models. For me, getting these dependencies and parameters right was difficult, and the langchain docs brush over this. I found the above comment very useful since it showed how to load an LLM in python and then pass it to a Langchain LLM adapter (I know it's trivial to do, I still found it to be the missing link for me)
For a little more context on where I'm coming from; 10-year ruby/JavaScript/elixir dev, professional and hobbiest, with some python experience, but python is not my first language.
Tldr; Need a Local HF safetensors/pt example with all the necessary imports alongside all the hosted LLM examples
Does this help?
(Also, very excited about streaming support for local LLM's! Will be happy to test this if that's helpful)
For those of you who are interested in using transformer with langchain's HuggingFacePipeline, I was able to run Llama locally using the HuggingFacePipeline provided as part of langchain.llms. It works pretty well using the 8-bit models, and I have an example use here: https://github.com/bowenwen/llama-at-home/blob/4d550a738136b7e9a23dc1f2b9bbf95367f504e2/src/my_langchain_models.py#L135-L214
If you would like to also use lora (with alpaca fine tuning), it is possible but you will get a warning message saying PeftModelForCausalLM is not supported for the transformer pipeline, but as far as I can tell, everything works fine. This is how I implemented Peft model with lora: https://github.com/bowenwen/llama-at-home/blob/4d550a738136b7e9a23dc1f2b9bbf95367f504e2/src/my_langchain_models.py#L65-L99
Here is a more readable example loading both base Llama and the alpaca lora: https://github.com/bowenwen/llama-at-home/blob/4d550a738136b7e9a23dc1f2b9bbf95367f504e2/example.py
I also have some wrappers I wrote to help with integrating various components of langchain like docs with vectorstore and agent executors. Feel free to browse around my repo and grab anything you'd like.
Bonus points if it includes an example for loading with GPTQ 4-bit, as a lot of people need to run 4-bit locally due to limited VRAM.
For that you need to do a bit more : 1 - first git clone GPTQ + cuda_setup.py : https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda 2 - make a crude hack inspired by text-generation-webui
import re
import sys
import pathlib
from pathlib import Path
import accelerate
import torch
import transformers
from transformers import AutoConfig, AutoModelForCausalLM
#obviously dependent of where is your file
path = pathlib.Path(__file__).parent.parent
dirGPTQ = path.joinpath("text-generation-webui/repositories/GPTQ-for-LLaMa/")
sys.path.insert(0, str(dirGPTQ))
from quant import make_quant
#from modelutils import find_layers
import torch
import torch.nn as nn
DEV = torch.device('cuda:0')
def find_layers(module, layers=[nn.Conv2d, nn.Linear], name=''):
if type(module) in layers:
return {name: module}
res = {}
for name1, child in module.named_children():
res.update(find_layers(
child, layers=layers, name=name + '.' + name1 if name != '' else name1
))
return res
def _load_quant(model, checkpoint, wbits, groupsize=-1, faster_kernel=False, exclude_layers=['lm_head'], kernel_switch_threshold=128):
config = AutoConfig.from_pretrained(model)
def noop(*args, **kwargs):
pass
torch.nn.init.kaiming_uniform_ = noop
torch.nn.init.uniform_ = noop
torch.nn.init.normal_ = noop
torch.set_default_dtype(torch.half)
transformers.modeling_utils._init_weights = False
torch.set_default_dtype(torch.half)
model = AutoModelForCausalLM.from_config(config)
torch.set_default_dtype(torch.float)
model = model.eval()
layers = find_layers(model)
for name in exclude_layers:
if name in layers:
del layers[name]
make_quant(model, layers, wbits, groupsize, faster=faster_kernel, kernel_switch_threshold=kernel_switch_threshold)
del layers
print('Loading model ...')
if checkpoint.endswith('.safetensors'):
from safetensors.torch import load_file as safe_load
model.load_state_dict(safe_load(checkpoint))
else:
model.load_state_dict(torch.load(checkpoint))
model.seqlen = 2048
print('Done.')
return model
def load_quantized(model_name, args):
if not "model_type" in args:
# Try to determine model type from model name
name = model_name.lower()
if any((k in name for k in ['llama', 'alpaca'])):
model_type = 'llama'
elif any((k in name for k in ['opt-', 'galactica'])):
model_type = 'opt'
elif any((k in name for k in ['gpt-j', 'pygmalion-6b'])):
model_type = 'gptj'
else:
print("Can't determine model type from model name. Please specify it manually using --model_type "
"argument")
exit()
else:
model_type = args["model_type"].lower()
if model_type == 'llama' and "pre_layer" in args:
load_quant = llama_inference_offload.load_quant
elif model_type in ('llama', 'opt', 'gptj'):
load_quant = _load_quant
else:
print("Unknown pre-quantized model type specified. Only 'llama', 'opt' and 'gptj' are supported")
exit()
#Now we are going to try to locate the quantized model file.
path_to_model = Path(f'{args["model-dir"]}{model_name}')
found_pts = list(path_to_model.glob("*.pt"))
found_safetensors = list(path_to_model.glob("*.safetensors"))
pt_path = None
if len(found_pts) == 1:
pt_path = found_pts[0]
elif len(found_safetensors) == 1:
pt_path = found_safetensors[0]
else:
if path_to_model.name.lower().startswith('llama-7b'):
pt_model = f'llama-7b-{args["wbits"]}bit'
elif path_to_model.name.lower().startswith('llama-13b'):
pt_model = f'llama-13b-{args["wbits"]}bit'
elif path_to_model.name.lower().startswith('llama-30b'):
pt_model = f'llama-30b-{args["wbits"]}bit'
elif path_to_model.name.lower().startswith('llama-65b'):
pt_model = f'llama-65b-{args["wbits"]}bit'
else:
pt_model = f'{model_name}-{args["wbits"]}bit'
#Try to find the .safetensors or .pt both in models/ and in the subfolder
for path in [Path(p+ext) for ext in ['.safetensors', '.pt'] for p in [f"models/{pt_model}", f"{path_to_model}/{pt_model}"]]:
if path.exists():
print(f"Found {path}")
pt_path = path
break
if not pt_path:
print("Could not find the quantized model in .pt or .safetensors format, exiting...")
exit()
#qwopqwop200's offload
if "pre_layer" in args:
model = load_quant(str(path_to_model), str(pt_path), args["wbits"], args["groupsize"], args["pre_layer"])
else:
threshold = False if model_type == 'gptj' else 128
model = load_quant(str(path_to_model), str(pt_path), args["wbits"], args["groupsize"], kernel_switch_threshold=threshold)
#accelerate offload (doesn't work properly)
if "gpu_memory" in args:
memory_map = list(map(lambda x : x.strip(), args["gpu_memory"]))
max_cpu_memory = args["cpu_memory"].strip() if "cpu_memory" in args else '99GiB'
max_memory = {}
for i in range(len(memory_map)):
max_memory[i] = f'{memory_map[i]}GiB' if not re.match('.*ib$', memory_map[i].lower()) else memory_map[i]
max_memory['cpu'] = max_cpu_memory
device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
print("Using the following device map for the 4-bit model:", device_map)
#https://huggingface.co/docs/accelerate/package_reference/big_modeling#accelerate.dispatch_model
model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True)
#No offload
elif not "cpu" in args:
model = model.to(torch.device('cuda:0'))
return model
3 - instead of model = LlamaForCausalLM.from_pretrained(model_id, device_map="auto",load_in_8bit=True)
import the local model with
arg["wbits"]=4 arg["groupsize"]=128 model = load_quantized(model_id,arg)
@msenneret
Thanks for your answer. I am able to load the model now. The problem is when I create a pipeline and try to make inference using langchain it is giving this error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
Not sure how to configure transformers.pipeline
's device_map
for a 4bit model. Find the code below
Note I cannot access
model.df_device_map
becauseLlamaForCausalLM
object has no attributehf_device_map
args = {
"wbits": 4,
"groupsize": 128,
"model_type": "llama",
"model_dir": GPTQ_MODEL_DIR,
}
# model = load_quantized(MODEL_NAME, args=AttributeDict(args))
model, tokenizer = load_model(MODEL_NAME, args=AttributeDict(args))
pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_length=100,
device_map='auto'
)
local_llm = HuggingFacePipeline(pipeline=pipeline)
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(
prompt=prompt,
llm=local_llm
)
llm_chain.run('What is the capital of India?')
Solved the issue by adding these lines of code before creating the HF pipeline
import accelerate
max_memory = {
0: 6000000000, # (bytes) according to your GPU
'cpu': 13000000000 # (bytes) according to your RAM
}
device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True)
And then passing the device_map
instead of 'auto'
to transformers.pipeline
pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_length=100,
device_map=device_map
)
@abhinand5 Can you share your load_model
code?
Hi, @slavakurilyak! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, this issue is a request to integrate LangChain with Stanford's Alpaca 7B model. There have been several comments discussing different ways to achieve this integration, including using Instruct-tuned LLaMA models, a basic LangChainJS package for Alpaca, and a Python wrapper for LLaMA. There has also been a discussion about running Llama locally using the HuggingFacePipeline provided by LangChain. Some users have even shared their examples and wrappers for integrating various components of LangChain.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain community! Let us know if you have any further questions or concerns.
for me it is fine to close it. I have continued my work with Llama-2-7B-Chat instead of Alpaca.
Thank you, @robertschulze, for your response! We appreciate your understanding and contribution to the LangChain community. We'll go ahead and close this issue now. If you have any further questions or need assistance in the future, feel free to reach out.