ragas Failed to parse output.

Describe the bug Local LLMs either raise Timeout error or Fails to parse output.

Ragas version: 0.1.15 Python version: 3.11.3

Code to Reproduce

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
import pandas as pd
df = pd.read_csv("output.csv", sep=";")

data_samples = {
    'question': df['question'].tolist(),
    'answer': df['answer'].tolist(),
    'contexts': df['contexts'].apply(lambda x: [x] if isinstance(x, str) else x).tolist(),
    'ground_truth': df['ground_truth'].tolist()
}

from datasets import Dataset 
dataset = Dataset.from_dict(data_samples)

from ragas import evaluate
from ragas.metrics import (faithfulness, 
                           answer_correctness,    
                           answer_relevancy,
                           context_recall,
                           context_precision)

from langchain_community.llms.huggingface_endpoint import HuggingFaceEndpoint
end = HuggingFaceEndpoint(repo_id="mistralai/Mistral-7B-Instruct-v0.3", max_new_tokens=512)
huggingface_llm = ChatHuggingFace(llm=end, tokenizer=tokenizer)
huggingface_embeddings = HuggingFaceEmbeddings(model_name="nomic-ai/nomic-embed-text-v1.5")

metrics=[faithfulness, 
        answer_correctness,    
        answer_relevancy,
        context_recall,
        context_precision]

score = evaluate(dataset=dataset,
        metrics=metrics,
        llm=huggingface_llm,
        embeddings=huggingface_embeddings,
        raise_exceptions=False
)

Error trace

Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[304]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[444]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Exception raised in Job[169]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
Failed to parse output. Returning None.
Failed to parse output. Returning None..
Exception raised in Job[309]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[174]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[449]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[179]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[314]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[184]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Exception raised in Job[454]: ValidationError(2 validation errors for ContextPrecisionVerifications
__root__ -> 0 -> reason
  field required (type=value_error.missing)
__root__ -> 0 -> verdict
  field required (type=value_error.missing))
Failed to parse output. Returning None.
Exception raised in Job[461]: ClientResponseError(429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'))
Exception raised in Job[196]: ClientResponseError(429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'))
Exception raised in Job[462]: ClientResponseError(429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3'))

Additional context it only evaluates for answer_correctness, other values are all NaN

Aug 28 '24 06:08 g-hano

To resolve the "Failed to parse output" and "ValidationError" issues when using the evaluate function from the ragas library with the HuggingFaceEndpoint and Mistral-7B-Instruct-v0.3 model, you can follow these steps:

Ensure JSON Schema Compliance: Make sure that the output from the model strictly adheres to the JSON schema expected by the PydanticOutputParser. The get_json_format_instructions function provides a template for the expected JSON format. Ensure that the model's output matches this format.
Use the FIX_OUTPUT_FORMAT Prompt: The RagasoutputParser class has a mechanism to retry parsing the output if it fails initially. It uses the FIX_OUTPUT_FORMAT prompt to instruct the model to correct its output format. Ensure that this mechanism is properly utilized.
Increase max_retries: If the output often fails to parse correctly, consider increasing the max_retries parameter in the aparse method to give the model more chances to correct its output.

Here is an example of how you might adjust the RagasoutputParser class to handle these issues:

class RagasoutputParser(PydanticOutputParser):
    async def aparse(  # type: ignore
        self, result: str, prompt: PromptValue, llm: BaseRagasLLM, max_retries: int = 3  # Increase retries
    ):
        try:
            output = super().parse(result)
        except OutputParserException:
            if max_retries != 0:
                p_value = FIX_OUTPUT_FORMAT.format(
                    prompt=prompt.to_string(), completion=result
                )
                output = await llm.generate(p_value)
                result = output.generations[0][0].text
                return await self.aparse(result, prompt, llm, max_retries - 1)
            else:
                logger.warning("Failed to parse output. Returning None.")
                return None
        return output

By increasing the max_retries parameter, you give the model more opportunities to correct its output format, which can help mitigate the "Failed to parse output" and "ValidationError" issues ^[1].

Additionally, there are specific configurations or parameters for the evaluate function in the ragas library that could affect the output parsing or model response. These include:

metrics: List of metrics to use for evaluation.
llm: The language model to use for the metrics.
embeddings: The embeddings to use for the metrics.
callbacks: Lifecycle Langchain Callbacks to run during evaluation.
in_ci: Whether the evaluation is running in CI or not.
run_config: Configuration for runtime settings like timeout and retries.
token_usage_parser: Parser to get the token usage from the LLM result.
raise_exceptions: Whether to raise exceptions or not.
column_map: The column names of the dataset to use for evaluation.

These parameters can influence how the evaluation is conducted and how the results are parsed and returned ^[2]^[3].

For the ContextPrecisionVerifications class, the expected JSON schema is:

{
  "__root__": [
    {
      "reason": "string",
      "verdict": 0 or 1
    }
  ]
}

Each ContextPrecisionVerification object contains the fields reason (a string) and verdict (an integer, 0 or 1) ^[4].

Lastly, regarding the ClientResponseError with a 429 status code, it indicates that you are hitting the rate limits of the HuggingFace API. You might need to implement rate limiting and retry logic to handle these errors gracefully. Unfortunately, I couldn't find specific details on how the HuggingFaceEndpoint class handles these errors within the repository.

To continue talking to Dosu, mention @dosu.

Aug 28 '24 06:08 dosubot[bot]

@g-hano were you able to resolve this issue?

Oct 17 '24 07:10 bithikaj

bithikaj No, I won't use Ragas anymore, I always get timeout or parse errors.

Oct 17 '24 10:10 g-hano

hey @bithikaj which version are you using - I would suggest you maybe upgrade to v0.2. we some improvements in the prompts to fix this

hey @g-hano, I'm really sorry to hear that. it's true we had many small bugs like this which hurt the experience but now we have fixed a few common ones in v0.2. If you check it out or need any help migrating let me know. Would love your feedback 🙂

Oct 17 '24 10:10 jjmachan

@jjmachan I am using v0.2.1 - getting many timeout errors with aws bedrock

Oct 18 '24 04:10 bithikaj

what rate-limits do you have with your AWS bedrock plan?

could you also checkout: https://docs.ragas.io/en/stable/howtos/customizations/_run_config

that should help reduce timeout error - to try it out and let me know if it works? else we'll explore more

Oct 18 '24 09:10 jjmachan

@jjmachan already added run config -RunConfig(max_workers=16, timeout=180), was getting error on 33% of my eval dataset. Would setting timeout to None make the timeout infinite?

ratelimits on AWS bedrock for claude haiku is requests per minute = 1000 tokens per minute = 2,000,000

Oct 18 '24 09:10 bithikaj

@bithikaj nope but I think the best way would be to reduce the max_workers - did you fix this?

Oct 23 '24 15:10 jjmachan

Closing this as resolved using latest versions.

Some config are to be fixed on llm end.

Oct 30 '25 11:10 anistark

I commented in another issue #1681 that this errors still occur to me in ragas 0.3.5 using a 7B instruct and a 14B instruct models. Also my evaluations are a little suspicious, as cases where my context clearly has the answer for a query are still scored really low in context precision/recall (and I'm talking about 0 or close 0 values)

As I mentioned on my comment of the referenced issue, " I guess there is a case of study in RAGAS which is the use of local models, specially Small Language Models and it's limitations of acting as judges"

Nov 13 '25 12:11 SamuTheCoder

@SamuTheCoder Please try with 0.3.9 and instead of metrics, use metrics.collections

Nov 13 '25 12:11 anistark