langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Alpaca (Fine-tuned LLaMA)

Open slavakurilyak opened this issue 1 year ago • 20 comments

It would be great to see LangChain integrate with Standford's Alpaca 7B model, a fine-tuned LlaMa (see https://github.com/hwchase17/langchain/issues/1473).

Standford created an AI able to generate outputs that were largely on par with OpenAI’s text-davinci-003 and regularly better than GPT-3 — all for a fraction of the computing power and price.

Stanford's Alpaca is a language model that was fine-tuned from Meta's LLaMA with 52,000 instructions generated by GPT-3.5[1]. The researchers used AI-generated instructions to train Alpaca 7B, which exhibits many GPT-3.5-like behaviors[1]. In a blind test using input from the Self-Instruct Evaluation Set, both models performed comparably[1][2]. However, Alpaca has problems common to other language models such as hallucinations, toxicity, and stereotyping[1]. The team is releasing an interactive demo, the training dataset, and the training code for research purposes[1][4].

Alpaca is still under development and has limitations that need to be addressed. The researchers have not yet fine-tuned the model to be safe and harmless[2]. They encourage users to be cautious when interacting with Alpaca and report any concerning behavior to help improve it[2].

LLaMA is a new open-source language model from Meta Research that performs as well as closed-source models. Stanford's Alpaca is a fine-tuned version of LLaMA that can respond to instructions like ChatGPT[3]. It functions more like a fancy version of autocomplete than a conversational bot[3].

The researchers are releasing their findings about an instruction-following language model dubbed Alpaca. They trained the Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAI’s text-davinci-003 but is also surprisingly small and easy/cheap to reproduce[4].

slavakurilyak avatar Mar 19 '23 15:03 slavakurilyak

Also worth considering Instruct-tuned LLaMA models that can run on consumer hardware, such as https://github.com/tloen/alpaca-lora.

lorepieri8 avatar Mar 20 '23 14:03 lorepieri8

A basic LangChainJS package for Alpaca is here https://github.com/linonetwo/langchain-alpaca

linonetwo avatar Mar 20 '23 18:03 linonetwo

A basic LangChainJS package for Alpaca is here https://github.com/linonetwo/langchain-alpaca

Great!! but would be nicer if there's a python wrapper

iRanadheer avatar Mar 21 '23 01:03 iRanadheer

I hacked together a working Python LLM class based on Serge, which is MIT licensed: https://gist.github.com/lukestanley/6517823485f88a40a09979c1a19561ce I am new to LangChain and very new to making LLM classes, and streaming is complicated. But this works for each streaming situation I tried. Someone more familiar could probably do a much better job sooner than me but as there is a lot of interest in breaking free of the cloud and finally being able to work locally I figured I'd put this out now before polishing it nicely. @iRanadheer @slavakurilyak

lukestanley avatar Mar 23 '23 20:03 lukestanley

I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to llama-cpp-python and llama-cpp You can specify the model in the config file, and the app will download it automatically and expose it via an API Additionally you can use https://github.com/1b5d/langchain-llm-api in order to use this exposed API with Langchain, it also supports streaming My goal is to easily run different models locally (and also remote) and switch between them easily, then use these APIs to develop with Langchain

To run Alpaca:

  • First configure and run docker compose up the API as described here: https://github.com/1b5d/llm-api
  • Then you can simply make requests to it
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "What is the capital of France?",
    "params": {
        ...
    }
}'
  • or you can play around with it using langchain via the lib
pip install langchain-llm-api
from langchain_llm_api import LLMAPI

llm = LLMAPI()
llm("What is the capital of France?")
...
\nThe capital of France is Paris.

1b5d avatar Apr 08 '23 22:04 1b5d

This is interesting, thanks for sharing it @1b5d I blinked and now there is a Python wrapper for llama.cpp! To best help most people here, it would be ideal to improve LangChain to support Alpaca / Llama without adding a dependency on Docker to LangChain or running a web server.

llama-cpp-python seems like a good wrapper around llama.cpp and could be wrapped with a class that implements the LLM class with streaming methods like my gist does but without the class itself running the command directly, so that there is no Docker or web server needed, it could have less dependencies to run.

I've seen the good work integrating my gist done on the alpaca branch by @robertschulze and @hwchase17: https://github.com/hwchase17/langchain/compare/harrison/alpaca

The paths look tricky, with remote models we don't have to think about file paths very often. But often have to think about the API keys. The OpenAI module checks for a well known environment variable first. Are there places where this project specifies file paths anywhere in a nice way with user overrides? The code needs a path to the LLM model and path to llama.cpp executable. Ideally we wouldn't have to think about paths for anything but we need to know where the model is until we're allowed to download ourselves (which is tricky license wise). We could probably eliminate needing to know where llama.cpp is, if we built it as part of the project but that might be more tricky (or maybe not if llama-cpp-python does that). I suppose the default thing to do would to have it added to the PATH shell variable.

Actually, looking at llama-cpp-python, it introduces an env var called "MODEL", which seems a bit too generic but may be okay as a first pass. It would be good if it was prebuilt for my platform (x86) but I think this is better than my gist. It built correctly on my system first time (Ubuntu with build-essential etc already installed). One thing that is much better than my gist is the fact that it does not need to reload for each call!

lukestanley avatar Apr 11 '23 08:04 lukestanley

I did some work on it here! Not sure it's PR worthy yet.

lukestanley avatar Apr 11 '23 18:04 lukestanley

There are use case for both local and remote model inference I believe, I want to run my models on a remote server, while others might have enough hardware power on their local machines

1b5d avatar Apr 11 '23 21:04 1b5d

I realised there is a Llamma class already integrated. Better put something in my eyes so I don't need to blink! I have a draft PR that adds streaming here: https://github.com/hwchase17/langchain/pull/2775

lukestanley avatar Apr 12 '23 16:04 lukestanley

there's alpaca on huggingface, i think you can do something like this: llm = HuggingFaceHub(repo_id="chavinlo/alpaca-native", model_kwargs={"temperature":0, "max_length":64})

annabechang avatar Apr 14 '23 17:04 annabechang

there's alpaca on huggingface, i think you can do something like this: llm = HuggingFaceHub(repo_id="chavinlo/alpaca-native", model_kwargs={"temperature":0, "max_length":64})

But this would not run the model locally.

robertschulze avatar Apr 14 '23 19:04 robertschulze

you can use the huggingfacepipeline class., something like this

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline

model_path = 'your_local_model_path'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
alpaca_pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
local_llm = HuggingFacePipeline(pipeline=alpaca_pipe)

iRanadheer avatar Apr 14 '23 19:04 iRanadheer

:clap: Thank you @iRanadheer - Information on how to integrate local models with langchain is sparse. This little tidbit is helpful. If you know of any repos that provide toy examples of such integration, please post a link!

tensiondriven avatar Apr 17 '23 04:04 tensiondriven

The LangChain repo has documentation on how to use LlamaCpp within LangChain with a LLM class like the OpenAI one. The LlamaCpp LLM class is probably faster and easier to setup for most people, and is documented here: https://python.langchain.com/en/latest/modules/models/llms/integrations/llamacpp.html and here: https://python.langchain.com/en/latest/ecosystem/llamacpp.html Like the underlying code, it can generate with Alpaca models too. I am working on adding streaming support to it here. This is happening fast and it might be hard to find, I think this issue might deserve closing soon since it works. I'll probably make a gif of it working with a funny example. Any ideas on other ways it could be more visible? @tensiondriven

lukestanley avatar Apr 17 '23 13:04 lukestanley

@lukestanley Thanks for asking!

I saw the LlamaCpp example, but my understanding is that the -Cpp interface/classes are very different from huggingface/pt/safetensors.

I agree that Cpp is more accessible for more people, but it doesn't address what I think is a pretty common use case: Using HF-format models locally for GPU inference with CUDA/Triton.

When I read the docs, I saw the bit about "integrating with runway, and also local" (was it called runway? I'm not sure, I'm on my phone at the moment)

What would have made it more clear is if there was a specific page for "Local LLM" that gives an example which takes a path to a safetensors file and includes setup of the required dependencies (pytorch/fromCasusalLangaugModel, etc.)

Bonus points if it includes an example for loading with GPTQ 4-bit, as a lot of people need to run 4-bit locally due to limited VRAM.

Extra bonus points if it includes an example of loading a safetensors model with a Lora.

I realize these implementation details are not specifically a concern of langchain, however given how fiddly the python deps are, I still think it would be useful to show a full example of standing up a GPU-based local model using a safetensors model as its own page in the docs under LLM's.

As a side-note, there are a few bugs in the chat-demo project, which there are fixes for in the GitHub Issues, but those small bugs do add significant friction for people who want to get langchain up and running (with either OpenAI or a different/local LLM).

For one additional bit of context, I think a lot of people are coming from using text-generation-webui, which, while being pretty poorly organized code, at least gives people exposure to the python dependency stack and some of the parameters used in loading language models. For me, getting these dependencies and parameters right was difficult, and the langchain docs brush over this. I found the above comment very useful since it showed how to load an LLM in python and then pass it to a Langchain LLM adapter (I know it's trivial to do, I still found it to be the missing link for me)

For a little more context on where I'm coming from; 10-year ruby/JavaScript/elixir dev, professional and hobbiest, with some python experience, but python is not my first language.

Tldr; Need a Local HF safetensors/pt example with all the necessary imports alongside all the hosted LLM examples

Does this help?

(Also, very excited about streaming support for local LLM's! Will be happy to test this if that's helpful)

tensiondriven avatar Apr 17 '23 14:04 tensiondriven

For those of you who are interested in using transformer with langchain's HuggingFacePipeline, I was able to run Llama locally using the HuggingFacePipeline provided as part of langchain.llms. It works pretty well using the 8-bit models, and I have an example use here: https://github.com/bowenwen/llama-at-home/blob/4d550a738136b7e9a23dc1f2b9bbf95367f504e2/src/my_langchain_models.py#L135-L214

If you would like to also use lora (with alpaca fine tuning), it is possible but you will get a warning message saying PeftModelForCausalLM is not supported for the transformer pipeline, but as far as I can tell, everything works fine. This is how I implemented Peft model with lora: https://github.com/bowenwen/llama-at-home/blob/4d550a738136b7e9a23dc1f2b9bbf95367f504e2/src/my_langchain_models.py#L65-L99

Here is a more readable example loading both base Llama and the alpaca lora: https://github.com/bowenwen/llama-at-home/blob/4d550a738136b7e9a23dc1f2b9bbf95367f504e2/example.py

I also have some wrappers I wrote to help with integrating various components of langchain like docs with vectorstore and agent executors. Feel free to browse around my repo and grab anything you'd like.

bowenwen avatar Apr 17 '23 17:04 bowenwen

Bonus points if it includes an example for loading with GPTQ 4-bit, as a lot of people need to run 4-bit locally due to limited VRAM.

For that you need to do a bit more : 1 - first git clone GPTQ + cuda_setup.py : https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda 2 - make a crude hack inspired by text-generation-webui

import re
import sys
import pathlib
from pathlib import Path

import accelerate
import torch
import transformers
from transformers import AutoConfig, AutoModelForCausalLM

#obviously dependent of where is your file
path = pathlib.Path(__file__).parent.parent
dirGPTQ = path.joinpath("text-generation-webui/repositories/GPTQ-for-LLaMa/")
sys.path.insert(0, str(dirGPTQ))

from quant import make_quant

#from modelutils import find_layers
import torch
import torch.nn as nn


DEV = torch.device('cuda:0')


def find_layers(module, layers=[nn.Conv2d, nn.Linear], name=''):
    if type(module) in layers:
        return {name: module}
    res = {}
    for name1, child in module.named_children():
        res.update(find_layers(
            child, layers=layers, name=name + '.' + name1 if name != '' else name1
        ))
    return res

def _load_quant(model, checkpoint, wbits, groupsize=-1, faster_kernel=False, exclude_layers=['lm_head'], kernel_switch_threshold=128):
    config = AutoConfig.from_pretrained(model)
    def noop(*args, **kwargs):
        pass
    torch.nn.init.kaiming_uniform_ = noop 
    torch.nn.init.uniform_ = noop 
    torch.nn.init.normal_ = noop 

    torch.set_default_dtype(torch.half)
    transformers.modeling_utils._init_weights = False
    torch.set_default_dtype(torch.half)
    model = AutoModelForCausalLM.from_config(config)
    torch.set_default_dtype(torch.float)
    model = model.eval()
    layers = find_layers(model)
    for name in exclude_layers:
        if name in layers:
            del layers[name]
    make_quant(model, layers, wbits, groupsize, faster=faster_kernel, kernel_switch_threshold=kernel_switch_threshold)

    del layers
    
    print('Loading model ...')
    if checkpoint.endswith('.safetensors'):
        from safetensors.torch import load_file as safe_load
        model.load_state_dict(safe_load(checkpoint))
    else:
        model.load_state_dict(torch.load(checkpoint))
    model.seqlen = 2048
    print('Done.')

    return model

def load_quantized(model_name, args):
    if not "model_type" in args:
        # Try to determine model type from model name
        name = model_name.lower()
        if any((k in name for k in ['llama', 'alpaca'])):
            model_type = 'llama'
        elif any((k in name for k in ['opt-', 'galactica'])):
            model_type = 'opt'
        elif any((k in name for k in ['gpt-j', 'pygmalion-6b'])):
            model_type = 'gptj'
        else:
            print("Can't determine model type from model name. Please specify it manually using --model_type "
                  "argument")
            exit()
    else:
        model_type = args["model_type"].lower()

    if model_type == 'llama' and "pre_layer" in args:
        load_quant = llama_inference_offload.load_quant
    elif model_type in ('llama', 'opt', 'gptj'):
        load_quant = _load_quant
    else:
        print("Unknown pre-quantized model type specified. Only 'llama', 'opt' and 'gptj' are supported")
        exit()

    #Now we are going to try to locate the quantized model file.
    path_to_model = Path(f'{args["model-dir"]}{model_name}')
    found_pts = list(path_to_model.glob("*.pt"))
    found_safetensors = list(path_to_model.glob("*.safetensors"))
    pt_path = None

    if len(found_pts) == 1:
        pt_path = found_pts[0]
    elif len(found_safetensors) == 1:
        pt_path = found_safetensors[0]
    else:
        if path_to_model.name.lower().startswith('llama-7b'):
            pt_model = f'llama-7b-{args["wbits"]}bit'
        elif path_to_model.name.lower().startswith('llama-13b'):
            pt_model = f'llama-13b-{args["wbits"]}bit'
        elif path_to_model.name.lower().startswith('llama-30b'):
            pt_model = f'llama-30b-{args["wbits"]}bit'
        elif path_to_model.name.lower().startswith('llama-65b'):
            pt_model = f'llama-65b-{args["wbits"]}bit'
        else:
            pt_model = f'{model_name}-{args["wbits"]}bit'

        #Try to find the .safetensors or .pt both in models/ and in the subfolder
        for path in [Path(p+ext) for ext in ['.safetensors', '.pt'] for p in [f"models/{pt_model}", f"{path_to_model}/{pt_model}"]]:
            if path.exists():
                print(f"Found {path}")
                pt_path = path
                break

    if not pt_path:
        print("Could not find the quantized model in .pt or .safetensors format, exiting...")
        exit()

    #qwopqwop200's offload
    if "pre_layer" in args:
        model = load_quant(str(path_to_model), str(pt_path), args["wbits"], args["groupsize"], args["pre_layer"])
    else:
        threshold = False if model_type == 'gptj' else 128
        model = load_quant(str(path_to_model), str(pt_path), args["wbits"], args["groupsize"], kernel_switch_threshold=threshold)

        #accelerate offload (doesn't work properly)
        if "gpu_memory" in args:
            memory_map = list(map(lambda x : x.strip(), args["gpu_memory"]))
            max_cpu_memory = args["cpu_memory"].strip() if "cpu_memory" in args else '99GiB'
            max_memory = {}
            for i in range(len(memory_map)):
                max_memory[i] = f'{memory_map[i]}GiB' if not re.match('.*ib$', memory_map[i].lower()) else memory_map[i]
            max_memory['cpu'] = max_cpu_memory

            device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
            print("Using the following device map for the 4-bit model:", device_map)
            #https://huggingface.co/docs/accelerate/package_reference/big_modeling#accelerate.dispatch_model
            model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True)

        #No offload
        elif not "cpu" in args:
            model = model.to(torch.device('cuda:0'))

    return model

3 - instead of model = LlamaForCausalLM.from_pretrained(model_id, device_map="auto",load_in_8bit=True) import the local model with arg["wbits"]=4 arg["groupsize"]=128 model = load_quantized(model_id,arg)

msenneret avatar Apr 18 '23 14:04 msenneret

@msenneret

Thanks for your answer. I am able to load the model now. The problem is when I create a pipeline and try to make inference using langchain it is giving this error.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

Not sure how to configure transformers.pipeline's device_map for a 4bit model. Find the code below

Note I cannot access model.df_device_map because LlamaForCausalLM object has no attribute hf_device_map

args = {
    "wbits": 4,
    "groupsize": 128,
    "model_type": "llama",
    "model_dir": GPTQ_MODEL_DIR,
}

# model = load_quantized(MODEL_NAME, args=AttributeDict(args))

model, tokenizer = load_model(MODEL_NAME, args=AttributeDict(args))

pipeline = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=100,
    device_map='auto'
)

local_llm = HuggingFacePipeline(pipeline=pipeline)

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(
    prompt=prompt, 
    llm=local_llm
)

llm_chain.run('What is the capital of India?')

abhinand5 avatar May 09 '23 12:05 abhinand5

Solved the issue by adding these lines of code before creating the HF pipeline

import accelerate

max_memory = {
    0: 6000000000, # (bytes) according to your GPU
    'cpu': 13000000000 # (bytes) according to your RAM
}

device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True)

And then passing the device_map instead of 'auto' to transformers.pipeline

pipeline = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=100,
    device_map=device_map
)

abhinand5 avatar May 09 '23 15:05 abhinand5

@abhinand5 Can you share your load_model code?

ernestp avatar May 19 '23 13:05 ernestp

Hi, @slavakurilyak! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, this issue is a request to integrate LangChain with Stanford's Alpaca 7B model. There have been several comments discussing different ways to achieve this integration, including using Instruct-tuned LLaMA models, a basic LangChainJS package for Alpaca, and a Python wrapper for LLaMA. There has also been a discussion about running Llama locally using the HuggingFacePipeline provided by LangChain. Some users have even shared their examples and wrappers for integrating various components of LangChain.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain community! Let us know if you have any further questions or concerns.

dosubot[bot] avatar Sep 22 '23 16:09 dosubot[bot]

for me it is fine to close it. I have continued my work with Llama-2-7B-Chat instead of Alpaca.

robertschulze avatar Sep 22 '23 16:09 robertschulze

Thank you, @robertschulze, for your response! We appreciate your understanding and contribution to the LangChain community. We'll go ahead and close this issue now. If you have any further questions or need assistance in the future, feel free to reach out.

dosubot[bot] avatar Sep 22 '23 16:09 dosubot[bot]