Failed to parse output.
Describe the bug Local LLMs either raise Timeout error or Fails to parse output.
Ragas version: 0.1.15 Python version: 3.11.3
Code to Reproduce
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
import pandas as pd
df = pd.read_csv("output.csv", sep=";")
data_samples = {
'question': df['question'].tolist(),
'answer': df['answer'].tolist(),
'contexts': df['contexts'].apply(lambda x: [x] if isinstance(x, str) else x).tolist(),
'ground_truth': df['ground_truth'].tolist()
}
from datasets import Dataset
dataset = Dataset.from_dict(data_samples)
from ragas import evaluate
from ragas.metrics import (faithfulness,
answer_correctness,
answer_relevancy,
context_recall,
context_precision)
from langchain_community.llms.huggingface_endpoint import HuggingFaceEndpoint
end = HuggingFaceEndpoint(repo_id="mistralai/Mistral-7B-Instruct-v0.3", max_new_tokens=512)
huggingface_llm = ChatHuggingFace(llm=end, tokenizer=tokenizer)
huggingface_embeddings = HuggingFaceEmbeddings(model_name="nomic-ai/nomic-embed-text-v1.5")
metrics=[faithfulness,
answer_correctness,
answer_relevancy,
context_recall,
context_precision]
score = evaluate(dataset=dataset,
metrics=metrics,
llm=huggingface_llm,
embeddings=huggingface_embeddings,
raise_exceptions=False
)
Error trace
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[304]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
__root__ -> 0 -> verdict
field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[444]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
__root__ -> 0 -> verdict
field required (type=value_error.missing))
Exception raised in Job[169]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
Failed to parse output. Returning None.
Failed to parse output. Returning None..
Exception raised in Job[309]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
__root__ -> 0 -> verdict
field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[174]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
__root__ -> 0 -> verdict
field required (type=value_error.missing))
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[449]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
__root__ -> 0 -> verdict
field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[179]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
__root__ -> 0 -> verdict
field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[314]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
__root__ -> 0 -> verdict
field required (type=value_error.missing))
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[184]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
__root__ -> 0 -> verdict
field required (type=value_error.missing))
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[454]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
field required (type=value_error.missing)
__root__ -> 0 -> verdict
field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[461]: ClientResponseError(429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'))
Exception raised in Job[196]: ClientResponseError(429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'))
Exception raised in Job[462]: ClientResponseError(429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'))
Additional context
it only evaluates for answer_correctness, other values are all NaN
To resolve the "Failed to parse output" and "ValidationError" issues when using the evaluate function from the ragas library with the HuggingFaceEndpoint and Mistral-7B-Instruct-v0.3 model, you can follow these steps:
-
Ensure JSON Schema Compliance: Make sure that the output from the model strictly adheres to the JSON schema expected by the
PydanticOutputParser. Theget_json_format_instructionsfunction provides a template for the expected JSON format. Ensure that the model's output matches this format. -
Use the
FIX_OUTPUT_FORMATPrompt: TheRagasoutputParserclass has a mechanism to retry parsing the output if it fails initially. It uses theFIX_OUTPUT_FORMATprompt to instruct the model to correct its output format. Ensure that this mechanism is properly utilized. -
Increase
max_retries: If the output often fails to parse correctly, consider increasing themax_retriesparameter in theaparsemethod to give the model more chances to correct its output.
Here is an example of how you might adjust the RagasoutputParser class to handle these issues:
class RagasoutputParser(PydanticOutputParser):
async def aparse( # type: ignore
self, result: str, prompt: PromptValue, llm: BaseRagasLLM, max_retries: int = 3 # Increase retries
):
try:
output = super().parse(result)
except OutputParserException:
if max_retries != 0:
p_value = FIX_OUTPUT_FORMAT.format(
prompt=prompt.to_string(), completion=result
)
output = await llm.generate(p_value)
result = output.generations[0][0].text
return await self.aparse(result, prompt, llm, max_retries - 1)
else:
logger.warning("Failed to parse output. Returning None.")
return None
return output
By increasing the max_retries parameter, you give the model more opportunities to correct its output format, which can help mitigate the "Failed to parse output" and "ValidationError" issues [1].
Additionally, there are specific configurations or parameters for the evaluate function in the ragas library that could affect the output parsing or model response. These include:
metrics: List of metrics to use for evaluation.llm: The language model to use for the metrics.embeddings: The embeddings to use for the metrics.callbacks: Lifecycle Langchain Callbacks to run during evaluation.in_ci: Whether the evaluation is running in CI or not.run_config: Configuration for runtime settings like timeout and retries.token_usage_parser: Parser to get the token usage from the LLM result.raise_exceptions: Whether to raise exceptions or not.column_map: The column names of the dataset to use for evaluation.
These parameters can influence how the evaluation is conducted and how the results are parsed and returned [2][3].
For the ContextPrecisionVerifications class, the expected JSON schema is:
{
"__root__": [
{
"reason": "string",
"verdict": 0 or 1
}
]
}
Each ContextPrecisionVerification object contains the fields reason (a string) and verdict (an integer, 0 or 1) [4].
Lastly, regarding the ClientResponseError with a 429 status code, it indicates that you are hitting the rate limits of the HuggingFace API. You might need to implement rate limiting and retry logic to handle these errors gracefully. Unfortunately, I couldn't find specific details on how the HuggingFaceEndpoint class handles these errors within the repository.
To continue talking to Dosu, mention @dosu.
@g-hano were you able to resolve this issue?
bithikaj No, I won't use Ragas anymore, I always get timeout or parse errors.
hey @bithikaj which version are you using - I would suggest you maybe upgrade to v0.2. we some improvements in the prompts to fix this
hey @g-hano, I'm really sorry to hear that. it's true we had many small bugs like this which hurt the experience but now we have fixed a few common ones in v0.2. If you check it out or need any help migrating let me know. Would love your feedback 🙂
@jjmachan I am using v0.2.1 - getting many timeout errors with aws bedrock
what rate-limits do you have with your AWS bedrock plan?
could you also checkout: https://docs.ragas.io/en/stable/howtos/customizations/_run_config
that should help reduce timeout error - to try it out and let me know if it works? else we'll explore more
@jjmachan already added run config -RunConfig(max_workers=16, timeout=180), was getting error on 33% of my eval dataset. Would setting timeout to None make the timeout infinite?
ratelimits on AWS bedrock for claude haiku is requests per minute = 1000 tokens per minute = 2,000,000
@bithikaj nope but I think the best way would be to reduce the max_workers - did you fix this?
Closing this as resolved using latest versions.
Some config are to be fixed on llm end.
I commented in another issue #1681 that this errors still occur to me in ragas 0.3.5 using a 7B instruct and a 14B instruct models. Also my evaluations are a little suspicious, as cases where my context clearly has the answer for a query are still scored really low in context precision/recall (and I'm talking about 0 or close 0 values)
As I mentioned on my comment of the referenced issue, " I guess there is a case of study in RAGAS which is the use of local models, specially Small Language Models and it's limitations of acting as judges"
@SamuTheCoder Please try with 0.3.9 and instead of metrics, use metrics.collections