llama_index HugginFace example

Hello, is there any example of query by index with custom llm or open source llm from hugging face? I tried this solution as LLM https://github.com/jerryjliu/gpt_index/issues/423#issuecomment-1427248788 but it does not find an answer on the paul_graham_essay run infinitely

Feb 25 '23 07:02 TheShasa

I've tried quite a few huggingface LLMs and haven't found that problem. Can you share a minimum code example to replicate your issue?

Feb 25 '23 21:02 logan-markewich

In data I have only paul_graham_essay.txt

from gpt_index import SimpleDirectoryReader, LangchainEmbedding, GPTListIndex
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LLMPredictor
import torch
from langchain.llms.base import LLM
from transformers import pipeline


class FlanLLM(LLM):
    model_name = "google/flan-t5-xl"
    pipeline = pipeline("text2text-generation", model=model_name, device="cpu", model_kwargs={"torch_dtype":torch.bfloat16})

    def _call(self, prompt, stop=None):
        return self.pipeline(prompt, max_length=9999)[0]["generated_text"]

    def _identifying_params(self):
        return {"name_of_model": self.model_name}

    def _llm_type(self):
        return "custom"


llm_predictor = LLMPredictor(llm=FlanLLM())
hfemb = HuggingFaceEmbeddings()
embed_model = LangchainEmbedding(hfemb)


documents = SimpleDirectoryReader('data').load_data()
index = GPTListIndex(documents, embed_model=embed_model, llm_predictor=llm_predictor)

index.save_to_disk('index.json')

new_index = GPTListIndex.load_from_disk('index.json', embed_model=embed_model)

After this next cell is running continously:


response = new_index.query("What did the author do growing up?",
    mode="embedding", 
    embed_model=embed_model, llm_predictor=llm_predictor
    )

Feb 26 '23 11:02 TheShasa

Looks like it worked for me? I copied everything exactly 🤔 here's the response -> Before college the two main things I worked on, outside of school, were writing and programming

Keep in mind you are running on CPU, so things will be slower to begin with. Furthermore, you should use a prompt helper to limit the tokens going to the LLM, since FLAN really only uses the first 512 tokens. Depending on your environment, this might be causing issues? (although for me it just prints a warning, but runs properly)

Here's an example of the prompt helper:

# set number of output tokens
num_output = 150
# set maximum input size
max_input_size = 512
# set maximum chunk overlap
max_chunk_overlap = 20


prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
index = GPTListIndex(documents, embed_model=embed_model, llm_predictor=llm_predictor, prompt_helper=prompt_helper)

Feb 26 '23 20:02 logan-markewich

@TheShasa could you confirm if @logan-markewich works? Let me know!

Feb 28 '23 08:02 jerryjliu

Hey, sorry for late response, yes I realized that problem was in CPU, on GPU it runs fine, thank you!

Feb 28 '23 09:02 TheShasa

sweet! thanks for the help @logan-markewich

Mar 06 '23 21:03 jerryjliu

@TheShasa Thank you for the code example. I tried this code, but it seems that the line new_index = GPTListIndex.load_from_disk('index.json', embed_model=embed_model) generates some weird outputs inside jupyter notebook. Can you tell which version of llama_index or gpt_index you were working on ?

Mar 16 '23 15:03 captainst

@captainst , Hello, version of llama_index is 0.4.29

Mar 18 '23 09:03 TheShasa

I have a question here. Why are we using the embed_model for?

To be more specific, in this line:

embed_model = LangchainEmbedding(hfemb)

Apr 06 '23 18:04 lucasalvarezlacasa

@lucasalvarezlacasa the embedding model is needed for vector indexes.

Documents are chunked and embedded, and then your query text is also embedded and used to fetch relevant context from the index. Then the LLM generates an answer using the query and retrieved text

Apr 06 '23 18:04 logan-markewich

So, if I understood correctly, the embeddings are used in order to know which potential nodes (or chunks) from all of the documents that I have should be added as context to the query so that the LLM can I use it to generate the proper answer right?

And I guess that if you don't provide this argument, then it'll try to use the OpenAI embeddings by default. Is that correct?

Thank you!!

Apr 06 '23 19:04 lucasalvarezlacasa

@lucasalvarezlacasa you got it! That's how the vector index works 💪

There are other types of indexes that don't use embeddings though, they all have slightly different use cases (list, keyword, tree, etc)

Apr 06 '23 22:04 logan-markewich

class FlanLLM(LLM):
    model_name = "google/flan-t5-xl"
    pipeline = pipeline("text2text-generation", model=model_name, device="cpu", model_kwargs={"torch_dtype":torch.bfloat16})

    def _call(self, prompt, stop=None):
        return self.pipeline(prompt, max_length=9999)[0]["generated_text"]

    def _identifying_params(self):
        return {"name_of_model": self.model_name}

    def _llm_type(self):
        return "custom"

hi, im getting 504 error while running hfemb = HuggingFaceEmbeddings() ://

HfHubHTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/models/sentence-transformers/all-mpnet-base-v2

May 09 '23 11:05 WiktorLigeza

It looks the model was removed from huggingface... and langchain needs to be updated to change their default

Try a different model maybe

hdembed = HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L12-v2")

May 09 '23 15:05 logan-markewich

Also there's a bug it looks like in huggingface actually

https://twitter.com/huggingface/status/1655760648926642178

May 09 '23 15:05 logan-markewich

Well, it turned out that it wasn't working on the google colab, but it is working just fine on my pc

May 10 '23 09:05 WiktorLigeza

Yea, it should be mostly fixed now. Huggingface was having outages

May 10 '23 15:05 logan-markewich

llama_index llama_index copied to clipboard

HugginFace example

llama_index
llama_index copied to clipboard