langchain
langchain copied to clipboard
Timeout when running hugging face LLMs for ConversationRetrivalChain
I am getting the following error when trying to query from a ConversationalRetrievalChain using HuggingFace.
( a ValueError: Error raised by inference API: Model stabilityai/stablelm-tuned-alpha-3b time out
I am query the model using a simple LLMChain, but querying on documents seems to be the issue, any idea ?
This is the code :
from langchain import HuggingFaceHub
llm = HuggingFaceHub(repo_id="stabilityai/stablelm-tuned-alpha-3b" , model_kwargs={"temperature":0, "max_length":64})
qa2 = ConversationalRetrievalChain.from_llm(llm,
vectordb.as_retriever(search_kwargs={"k": 3}), return_source_documents=True)
chat_history = []
query = "How do I load users from a thumb drive."
result = qa2({"question": query, "chat_history": chat_history})
vectordb is coming from Chroma.from_documents (using OpenAI embeddings on a custom pdf) .
Hi! I have this problem too and I can solve this with increasing the chunk size on the model: llm=HuggingFaceHub(repo_id="google/flan-t5-xl", model_kwargs={"temperature":0, "max_length":512})
With chunksize = 5000 in the VectorstoreIndexCreator() the model can run.
Depending on the LLM this length limit is different, so is important take care in this sizes. I recommend use others LLM's or adjust the chunksize for run the 'stabilityai/stablelm-tuned-alpha-3b'.
HI,
actually i have the same issue.
I change the chunksize and also the llm to
HuggingFaceHub(repo_id="togethercomputer/GPT-NeoXT-Chat-Base-20B", model_kwargs={"temperature":0, "max_length":500})
I got the same Timeout or Service Unavailable
thanks for any helf
I don't think this is related to ConversationRetrivalChain as I'm seeing the same issue when directly prompting a HuggingFaceHub object. For instance, when I do:
repo_id = "google/flan-t5-xl"
llm_hf = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature":0, "max_length":128})
print(llm_hf(text))
It hangs for a while then returns the timeout error.
I'll ping HuggingFace's API directly to check whether the issue comes from HuggingFace or HuggingFaceHub.
This is happening with me... I was able to get stabilityai/stablelm-base-alpha-3b to return something. But everything else is stuck
just keep on getting ValueError: Error raised by inference API: Model stabilityai/stablelm-tuned-alpha-7b time out
Here is my basic code to test that simply times out:
import os
from langchain import HuggingFaceHub
from langchain import PromptTemplate, LLMChain
from utilities import load_api_key
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
repo_id = "stabilityai/stablelm-tuned-alpha-7b"
# repo_id = "stabilityai/stablelm-tuned-alpha-3b"
# repo_id = "stabilityai/stablelm-base-alpha-3b"
# repo_id = "google/flan-t5-xl"
llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature": 0, "max_length": 64},
huggingfacehub_api_token=load_api_key("HUGGINGFACEHUB_API_TOKEN"))
template = """Question: {question}
Answer: """
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "Who won the FIFA World Cup in the year 1994? "
print(llm_chain.run(question))
This is happening for me as well. This Colab notebook for question answering worked for me a few days ago, but now when I enter a new query in cell 14 and run it, it will result in a timeout.
is there any solution yet?
from langchain.chains import RetrievalQA chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=index.vectorstore.as_retriever(search_kwargs={"k": 1}), input_key="question")
ValueError: Error raised by inference API: Model google/flan-t5-xl time out
Hi friends, anyone got any type of luck for fixing the "time out" issue with google/flan-t5-xl? Need help if you sorted it out. thanks
+1
Use this one: google/flan-t5-xxl and temperature more than 0. (Such as 0.7, 0.5, 1).
Even I am having the same issue. Thanks @phdykd for the workaround , But do we know the root cause of this issue. Thanks
I think I am having the same problem, but I'm not getting a timeout message. It's just hanging for a long time. I'm using Replit.
flan-small is working but as expected output is not great. Flan-XXL is working but max_length of 512 makes it harder to pass few shot template difficult and without this answers aren't that good.
I observed call is getting stuck in ssl.py. Will update if anything is found. Also appreciate any help on this.
Flan-XXL Is not working, can not found. Still in time_out issue... any solution?
Same here, below is the code I copied from the document:
from langchain.chains.question_answering import load_qa_chain from langchain import HuggingFaceHub
repo_id = "google/flan-t5-xl"
llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature":0, "max_length":64})
from langchain import PromptTemplate, LLMChain
template = """Question: {question}
Answer: Let's think step by step.""" prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
#question = "2+2 = ? " question = "Who won the FIFA World Cup in the year 1994? " print(llm_chain.run(question))
It works smoothly. But if I change the question to "Who won the FIFA World Cup in the year 1998? ", time out error happens
also receiving timeout for any Flan > large. Base and small are loading correctly
This is happening for me as well. This Colab notebook for question answering worked for me a few days ago, but now when I enter a new query in cell 14 and run it, it will result in a timeout.
Yes. I got the same error:
ValueError: Error raised by inference API: Model google/flan-t5-xl time out
Tried other LLMs as well, same error.
I also had similar errors with various flan llm's. Adding max length sorted out timeout errors.
llm_hf = HuggingFaceHub(
repo_id="google/flan-t5-xxl",
model_kwargs={"temperature":0.9,"max_length":64 }
)
for some reason, max_length does not make difference with t5-xl time-out. Anyways, t5-xxl works for now.
I have the same issue for a variety of models with various temperatures & max_length. I wonder if it has to do with HuggingFace limiting API calls from a single IP?
@philmui I don't think this error is intentional to limit the usage or anything. Obviously, handling LLMs in scale might potentially bring resource problems. Better to open the same issue in HF repo as well so that they are aware.
I got the same problem with "google/flan-t5-xl" :/
Getting same issue for StableLM, FLAN, or any model basically.
I utilized the HuggingFacePipeline to get the inference done locally, and that works as intended, but just cannot get it to run from HF hub.
Any idea how to solve this?
If I get a dedicated endpoint, can that be used with Langchain? If so, how? There does not seem to be a URL parameter that I can enter, only repo_id or model name.
google/flan-t5-xxl is working for me, although i had to also set temprature to a minimum of 0.1 to get it to work.
llm=HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.1, "max_length":64}))
You need to upgrade your hugging face account to Pro version to host the large model for inference.
"google/flan-t5-base" works for the free account.
@yitao416 wins. All problems I had with these LLMS vanished with this . But to be clear, the answers it gives you are super short . But it WORKS. Bad on HF for not making the Pro requirement clearer.
To not get truncated answers, you have to change the call method in the relevant class in langchain
On Fri, Jun 30, 2023, 12:49 UserHIJ @.***> wrote:
@yitao416 https://github.com/yitao416 wins. All problems I had with these LLMS vanished with this . But to be clear, the answers it gives you are super short . But it WORKS. Bad on HF for not making the Pro requirement clearer.
— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/3275#issuecomment-1614981980, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVN24G2QXWNMUYPQYXNHO23XN4GRBANCNFSM6AAAAAAXGNQ3WM . You are receiving this because you commented.Message ID: @.***>
ValueError: Error raised by inference API: Input validation error: inputs tokens + max_new_tokens must be <= 1512. Given: 3476 inputs tokens and 20 max_new_tokens
This happens when i upload pdf file of size 1 MB. How to resolve this issue?
Salesforce XGen model hangs for me as well as all Stability AI LLMs. Perhaps limits should be printed?
Hi everyone, maintainer of huggingface_hub's library here :wave:
Langchain's current implementation relies on InferenceAPI. This is a free service from Huggingface to help folks quickly test and prototype things using ML models hosted on the Hub. It is not specific to LLM and is able to run a large variety of models from transformers, diffusers, sentence-transformers,.... While this service is free for everyone, different rate limits apply to unregistered, registered and PRO users. However, the execution time is supposed to be the same no matter the plan (i.e. either you get a HTTP 429 Rate limit, either the request is processed correctly).
That being said, InferenceAPI is NOT meant to be a production-ready product. As said before, its primary goal is to allow users to quickly test models. Only models of a "small-enough" size can be hosted on-demand in the InferenceAPI infrastructure. In the context of LLMs, this constraint makes most models unavailable. We still support a manually curated list of first-class open source LLMs (bigcode/santacoder, OpenAssistant/oasst-sft-1-pythia-12b, tiiuae/falcon-7b-instruct, timdettmers/guanaco-33b-merged to name a few). The list of running LLMs can be found here. In the backend LLMs are running on our text-generation-inference framework.
If one need a production-ready environment, they would need to run on our Inference Endpoints solution. This is a paid service to run models (including LLMs) on a dedicated auto-scaling infrastructure managed by Huggingface. Any model from the Hub can be loaded.
Now that the difference between InferenceAPI and Inference Endpoints is more clear, let's dig in this issue more specifically. Langchain depends on the InferenceAPI client from huggingface_hub. This client will soon be deprecated in favor of InferenceClient. This later client is more recent and can handle both InferenceAPI, Inference Endpoint or even AWS Sagemaker solutions. Implementation is meant to be easier to use for end users, especially with automatic parsing of input/output for each task.
What I would suggest is to rewrite the HuggingFaceHub(LLM) class in langchain to use the new text-generation task. A migration guide can be found here. The advantage for the end user would be a better handling of timeouts, a better handling of inputs and the possibility to switch from InferenceAPI to InferenceEndpoint (or any url-based) solutions depending on the use case (prototyping vs production-ready).
If anyone wants to pick this topic, please let me know. I'd happy to answer questions and review a PR if requested! :hugs:
@Wauplin Hello, thanks for a great answer! I want to play around with llama-2, not creating a production grade app (yet). Is it possible without setting up cloud compute and/or Inference Endpoints?