llama_index
llama_index copied to clipboard
HugginFace example
Hello, is there any example of query by index with custom llm or open source llm from hugging face? I tried this solution as LLM https://github.com/jerryjliu/gpt_index/issues/423#issuecomment-1427248788 but it does not find an answer on the paul_graham_essay run infinitely
I've tried quite a few huggingface LLMs and haven't found that problem. Can you share a minimum code example to replicate your issue?
In data I have only paul_graham_essay.txt
from gpt_index import SimpleDirectoryReader, LangchainEmbedding, GPTListIndex
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LLMPredictor
import torch
from langchain.llms.base import LLM
from transformers import pipeline
class FlanLLM(LLM):
model_name = "google/flan-t5-xl"
pipeline = pipeline("text2text-generation", model=model_name, device="cpu", model_kwargs={"torch_dtype":torch.bfloat16})
def _call(self, prompt, stop=None):
return self.pipeline(prompt, max_length=9999)[0]["generated_text"]
def _identifying_params(self):
return {"name_of_model": self.model_name}
def _llm_type(self):
return "custom"
llm_predictor = LLMPredictor(llm=FlanLLM())
hfemb = HuggingFaceEmbeddings()
embed_model = LangchainEmbedding(hfemb)
documents = SimpleDirectoryReader('data').load_data()
index = GPTListIndex(documents, embed_model=embed_model, llm_predictor=llm_predictor)
index.save_to_disk('index.json')
new_index = GPTListIndex.load_from_disk('index.json', embed_model=embed_model)
After this next cell is running continously:
response = new_index.query("What did the author do growing up?",
mode="embedding",
embed_model=embed_model, llm_predictor=llm_predictor
)
Looks like it worked for me? I copied everything exactly 🤔 here's the response -> Before college the two main things I worked on, outside of school, were writing and programming
Keep in mind you are running on CPU, so things will be slower to begin with. Furthermore, you should use a prompt helper to limit the tokens going to the LLM, since FLAN really only uses the first 512 tokens. Depending on your environment, this might be causing issues? (although for me it just prints a warning, but runs properly)
Here's an example of the prompt helper:
# set number of output tokens
num_output = 150
# set maximum input size
max_input_size = 512
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
index = GPTListIndex(documents, embed_model=embed_model, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
@TheShasa could you confirm if @logan-markewich works? Let me know!
Hey, sorry for late response, yes I realized that problem was in CPU, on GPU it runs fine, thank you!
sweet! thanks for the help @logan-markewich
@TheShasa Thank you for the code example. I tried this code, but it seems that the line new_index = GPTListIndex.load_from_disk('index.json', embed_model=embed_model)
generates some weird outputs inside jupyter notebook. Can you tell which version of llama_index or gpt_index you were working on ?
@captainst , Hello, version of llama_index is 0.4.29
I have a question here. Why are we using the embed_model
for?
To be more specific, in this line:
embed_model = LangchainEmbedding(hfemb)
@lucasalvarezlacasa the embedding model is needed for vector indexes.
Documents are chunked and embedded, and then your query text is also embedded and used to fetch relevant context from the index. Then the LLM generates an answer using the query and retrieved text
So, if I understood correctly, the embeddings are used in order to know which potential nodes (or chunks) from all of the documents that I have should be added as context to the query so that the LLM can I use it to generate the proper answer right?
And I guess that if you don't provide this argument, then it'll try to use the OpenAI embeddings by default. Is that correct?
Thank you!!
@lucasalvarezlacasa you got it! That's how the vector index works 💪
There are other types of indexes that don't use embeddings though, they all have slightly different use cases (list, keyword, tree, etc)
class FlanLLM(LLM): model_name = "google/flan-t5-xl" pipeline = pipeline("text2text-generation", model=model_name, device="cpu", model_kwargs={"torch_dtype":torch.bfloat16}) def _call(self, prompt, stop=None): return self.pipeline(prompt, max_length=9999)[0]["generated_text"] def _identifying_params(self): return {"name_of_model": self.model_name} def _llm_type(self): return "custom"
hi, im getting 504 error while running hfemb = HuggingFaceEmbeddings() ://
HfHubHTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/models/sentence-transformers/all-mpnet-base-v2
It looks the model was removed from huggingface... and langchain needs to be updated to change their default
Try a different model maybe
hdembed = HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L12-v2")
Also there's a bug it looks like in huggingface actually
https://twitter.com/huggingface/status/1655760648926642178
Well, it turned out that it wasn't working on the google colab, but it is working just fine on my pc
Yea, it should be mostly fixed now. Huggingface was having outages