langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Timeout when running hugging face LLMs for ConversationRetrivalChain

Open parzival1l opened this issue 2 years ago • 29 comments

I am getting the following error when trying to query from a ConversationalRetrievalChain using HuggingFace. ( a ValueError: Error raised by inference API: Model stabilityai/stablelm-tuned-alpha-3b time out

I am query the model using a simple LLMChain, but querying on documents seems to be the issue, any idea ?

This is the code :

from langchain import HuggingFaceHub
llm = HuggingFaceHub(repo_id="stabilityai/stablelm-tuned-alpha-3b" , model_kwargs={"temperature":0, "max_length":64})
qa2 = ConversationalRetrievalChain.from_llm(llm,
                                    vectordb.as_retriever(search_kwargs={"k": 3}), return_source_documents=True)
chat_history = []
query = "How do I load users from a thumb drive."
result = qa2({"question": query, "chat_history": chat_history})

vectordb is coming from Chroma.from_documents (using OpenAI embeddings on a custom pdf) .

parzival1l avatar Apr 21 '23 06:04 parzival1l

Hi! I have this problem too and I can solve this with increasing the chunk size on the model: llm=HuggingFaceHub(repo_id="google/flan-t5-xl", model_kwargs={"temperature":0, "max_length":512})

With chunksize = 5000 in the VectorstoreIndexCreator() the model can run.

Depending on the LLM this length limit is different, so is important take care in this sizes. I recommend use others LLM's or adjust the chunksize for run the 'stabilityai/stablelm-tuned-alpha-3b'.

ygorbalves avatar Apr 22 '23 23:04 ygorbalves

HI,

actually i have the same issue.

I change the chunksize and also the llm to

HuggingFaceHub(repo_id="togethercomputer/GPT-NeoXT-Chat-Base-20B", model_kwargs={"temperature":0, "max_length":500})

I got the same Timeout or Service Unavailable

thanks for any helf

YHOYO avatar Apr 24 '23 21:04 YHOYO

I don't think this is related to ConversationRetrivalChain as I'm seeing the same issue when directly prompting a HuggingFaceHub object. For instance, when I do:

repo_id = "google/flan-t5-xl"
llm_hf = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature":0, "max_length":128})
print(llm_hf(text))

It hangs for a while then returns the timeout error. I'll ping HuggingFace's API directly to check whether the issue comes from HuggingFace or HuggingFaceHub.

bcrestel avatar Apr 26 '23 02:04 bcrestel

This is happening with me... I was able to get stabilityai/stablelm-base-alpha-3b to return something. But everything else is stuck just keep on getting ValueError: Error raised by inference API: Model stabilityai/stablelm-tuned-alpha-7b time out

Here is my basic code to test that simply times out:

import os

from langchain import HuggingFaceHub
from langchain import PromptTemplate, LLMChain

from utilities import load_api_key

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

repo_id = "stabilityai/stablelm-tuned-alpha-7b"
# repo_id = "stabilityai/stablelm-tuned-alpha-3b"
# repo_id = "stabilityai/stablelm-base-alpha-3b"
# repo_id = "google/flan-t5-xl"
llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature": 0, "max_length": 64},
                     huggingfacehub_api_token=load_api_key("HUGGINGFACEHUB_API_TOKEN"))

template = """Question: {question}

Answer: """
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who won the FIFA World Cup in the year 1994? "

print(llm_chain.run(question))

andymiller-og avatar Apr 26 '23 11:04 andymiller-og

This is happening for me as well. This Colab notebook for question answering worked for me a few days ago, but now when I enter a new query in cell 14 and run it, it will result in a timeout.

tjb4578 avatar Apr 28 '23 04:04 tjb4578

is there any solution yet?

from langchain.chains import RetrievalQA chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=index.vectorstore.as_retriever(search_kwargs={"k": 1}), input_key="question")

ValueError: Error raised by inference API: Model google/flan-t5-xl time out

phdykd avatar May 02 '23 17:05 phdykd

Hi friends, anyone got any type of luck for fixing the "time out" issue with google/flan-t5-xl? Need help if you sorted it out. thanks

phdykd avatar May 03 '23 15:05 phdykd

+1

pradosh-abd avatar May 04 '23 12:05 pradosh-abd

Use this one: google/flan-t5-xxl and temperature more than 0. (Such as 0.7, 0.5, 1).

phdykd avatar May 04 '23 13:05 phdykd

Even I am having the same issue. Thanks @phdykd for the workaround , But do we know the root cause of this issue. Thanks

chaitanya2593 avatar May 07 '23 12:05 chaitanya2593

I think I am having the same problem, but I'm not getting a timeout message. It's just hanging for a long time. I'm using Replit.

nkrefman avatar May 09 '23 05:05 nkrefman

flan-small is working but as expected output is not great. Flan-XXL is working but max_length of 512 makes it harder to pass few shot template difficult and without this answers aren't that good.

I observed call is getting stuck in ssl.py. Will update if anything is found. Also appreciate any help on this.

pradosh-abd avatar May 09 '23 16:05 pradosh-abd

Flan-XXL Is not working, can not found. Still in time_out issue... any solution?

phdykd avatar May 09 '23 17:05 phdykd

Same here, below is the code I copied from the document:

from langchain.chains.question_answering import load_qa_chain from langchain import HuggingFaceHub

repo_id = "google/flan-t5-xl"

llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature":0, "max_length":64})

from langchain import PromptTemplate, LLMChain

template = """Question: {question}

Answer: Let's think step by step.""" prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

#question = "2+2 = ? " question = "Who won the FIFA World Cup in the year 1994? " print(llm_chain.run(question))

It works smoothly. But if I change the question to "Who won the FIFA World Cup in the year 1998? ", time out error happens

alexzhaosheng avatar May 10 '23 13:05 alexzhaosheng

also receiving timeout for any Flan > large. Base and small are loading correctly

canivel avatar May 10 '23 14:05 canivel

This is happening for me as well. This Colab notebook for question answering worked for me a few days ago, but now when I enter a new query in cell 14 and run it, it will result in a timeout.

Yes. I got the same error:

ValueError: Error raised by inference API: Model google/flan-t5-xl time out

Tried other LLMs as well, same error.

spinning27 avatar May 10 '23 19:05 spinning27

I also had similar errors with various flan llm's. Adding max length sorted out timeout errors.

llm_hf = HuggingFaceHub(
        repo_id="google/flan-t5-xxl",
        model_kwargs={"temperature":0.9,"max_length":64 }
)

iriscortex avatar May 11 '23 22:05 iriscortex

for some reason, max_length does not make difference with t5-xl time-out. Anyways, t5-xxl works for now.

elvinagam avatar May 13 '23 05:05 elvinagam

I have the same issue for a variety of models with various temperatures & max_length. I wonder if it has to do with HuggingFace limiting API calls from a single IP?

philmui avatar May 14 '23 05:05 philmui

@philmui I don't think this error is intentional to limit the usage or anything. Obviously, handling LLMs in scale might potentially bring resource problems. Better to open the same issue in HF repo as well so that they are aware.

elvinagam avatar May 14 '23 09:05 elvinagam

I got the same problem with "google/flan-t5-xl" :/

AlexRoman938 avatar Jun 01 '23 17:06 AlexRoman938

Getting same issue for StableLM, FLAN, or any model basically.

I utilized the HuggingFacePipeline to get the inference done locally, and that works as intended, but just cannot get it to run from HF hub.

Any idea how to solve this?

If I get a dedicated endpoint, can that be used with Langchain? If so, how? There does not seem to be a URL parameter that I can enter, only repo_id or model name.

Abe410 avatar Jun 01 '23 23:06 Abe410

google/flan-t5-xxl is working for me, although i had to also set temprature to a minimum of 0.1 to get it to work. llm=HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.1, "max_length":64}))

yomhb avatar Jun 02 '23 23:06 yomhb

You need to upgrade your hugging face account to Pro version to host the large model for inference.

"google/flan-t5-base" works for the free account.

yitao416 avatar Jun 29 '23 14:06 yitao416

@yitao416 wins. All problems I had with these LLMS vanished with this . But to be clear, the answers it gives you are super short . But it WORKS. Bad on HF for not making the Pro requirement clearer.

UserHIJ avatar Jun 30 '23 17:06 UserHIJ

To not get truncated answers, you have to change the call method in the relevant class in langchain

On Fri, Jun 30, 2023, 12:49 UserHIJ @.***> wrote:

@yitao416 https://github.com/yitao416 wins. All problems I had with these LLMS vanished with this . But to be clear, the answers it gives you are super short . But it WORKS. Bad on HF for not making the Pro requirement clearer.

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/3275#issuecomment-1614981980, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVN24G2QXWNMUYPQYXNHO23XN4GRBANCNFSM6AAAAAAXGNQ3WM . You are receiving this because you commented.Message ID: @.***>

Abe410 avatar Jun 30 '23 17:06 Abe410

ValueError: Error raised by inference API: Input validation error: inputs tokens + max_new_tokens must be <= 1512. Given: 3476 inputs tokens and 20 max_new_tokens

This happens when i upload pdf file of size 1 MB. How to resolve this issue?

kuxall avatar Jul 04 '23 15:07 kuxall

Salesforce XGen model hangs for me as well as all Stability AI LLMs. Perhaps limits should be printed?

HashemAlsaket avatar Jul 06 '23 03:07 HashemAlsaket

Hi everyone, maintainer of huggingface_hub's library here :wave:

Langchain's current implementation relies on InferenceAPI. This is a free service from Huggingface to help folks quickly test and prototype things using ML models hosted on the Hub. It is not specific to LLM and is able to run a large variety of models from transformers, diffusers, sentence-transformers,.... While this service is free for everyone, different rate limits apply to unregistered, registered and PRO users. However, the execution time is supposed to be the same no matter the plan (i.e. either you get a HTTP 429 Rate limit, either the request is processed correctly).

That being said, InferenceAPI is NOT meant to be a production-ready product. As said before, its primary goal is to allow users to quickly test models. Only models of a "small-enough" size can be hosted on-demand in the InferenceAPI infrastructure. In the context of LLMs, this constraint makes most models unavailable. We still support a manually curated list of first-class open source LLMs (bigcode/santacoder, OpenAssistant/oasst-sft-1-pythia-12b, tiiuae/falcon-7b-instruct, timdettmers/guanaco-33b-merged to name a few). The list of running LLMs can be found here. In the backend LLMs are running on our text-generation-inference framework.

If one need a production-ready environment, they would need to run on our Inference Endpoints solution. This is a paid service to run models (including LLMs) on a dedicated auto-scaling infrastructure managed by Huggingface. Any model from the Hub can be loaded.


Now that the difference between InferenceAPI and Inference Endpoints is more clear, let's dig in this issue more specifically. Langchain depends on the InferenceAPI client from huggingface_hub. This client will soon be deprecated in favor of InferenceClient. This later client is more recent and can handle both InferenceAPI, Inference Endpoint or even AWS Sagemaker solutions. Implementation is meant to be easier to use for end users, especially with automatic parsing of input/output for each task.

What I would suggest is to rewrite the HuggingFaceHub(LLM) class in langchain to use the new text-generation task. A migration guide can be found here. The advantage for the end user would be a better handling of timeouts, a better handling of inputs and the possibility to switch from InferenceAPI to InferenceEndpoint (or any url-based) solutions depending on the use case (prototyping vs production-ready).

If anyone wants to pick this topic, please let me know. I'd happy to answer questions and review a PR if requested! :hugs:

Wauplin avatar Jul 11 '23 13:07 Wauplin

@Wauplin Hello, thanks for a great answer! I want to play around with llama-2, not creating a production grade app (yet). Is it possible without setting up cloud compute and/or Inference Endpoints?

MatousEibich avatar Jul 21 '23 06:07 MatousEibich