ragas Invalid response format. Expected a list of dictionaries with keys 'verdict'

Ragas version: 0.1.1 Python version: 3.8

Describe the bug

Getting all NaN scores. I am using open-source LLM and hosting in vLLM openAI server style, using HuggingfaceEmbeddings and creating a custom sample dataset.

Code to Reproduce

successfully ran vllm

python -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-alpha --tensor-parallel-size 4 --port 8000 --enforce-eager --gpu-memory-utilization 0.99

from langchain_openai.chat_models import ChatOpenAI
from datasets import Dataset
inference_server_url = "http://localhost:8000/v1"

#create vLLM Langchain instance
chat = ChatOpenAI(
    model="HuggingFaceH4/zephyr-7b-alpha",
    openai_api_key="no-key",
    openai_api_base=inference_server_url,
    max_tokens=5,
    temperature=0,
)

from ragas.embeddings import HuggingfaceEmbeddings
hf_embeddings = HuggingfaceEmbeddings(model_name="BAAI/bge-small-en")

#My dataset looks like this 
data  =  {
'question': ['What did the president say about gun violence?'],

 'answer': ['I do not have access to the most recent statements made by the president. please provide me with the specific date or context of the statement you are referring to.'],

 'contexts': [['The president asked Congress to pass proven measures to reduce gun violence.']],

 'ground_truth': ['The president asked Congress to pass proven measures to reduce gun violence.']
}
# Convert dict to dataset
dataset = Dataset.from_dict(data)

from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

from ragas import evaluate
result = evaluate(dataset = dataset,
                  llm = chat,
                  embeddings = hf_embeddings,
                  metrics=[context_precision,
                           context_recall,
                           faithfulness,
                           answer_relevancy,
                           ]
                  )  
df = result.to_pandas()

Error trace

Invalid response format. Expected a list of dictionaries with keys 'verdict' Invalid JSON response. Expected dictionary with key 'Attributed' Invalid JSON response. Expected dictionary with key 'question' /usr/local/lib/python3.8/dist-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice value = np.nanmean(self.scores[cn])

Feb 15 '24 09:02 ajaychinni

I've got the same issue with mixtral 8x7B.

Sometimes the llm will generate classification: [{"statement_1": "England won the 2022 ICC Men's T20 World Cup.", "reason": "From context it is clear that England defeated Pakistan to win the World Cup.", "Attributed": "1"}]

instead of the expected [{"statement_1": "England won the 2022 ICC Men's T20 World Cup.", "reason": "From context it is clear that England defeated Pakistan to win the World Cup.", "Attributed": "1"}]

or "verification": { "reason": "the provided context discusses the Andes mountain range, which, while impressive, does not include Mount Everest or directly relate to the question about the world's tallest mountain.", "verdict": "0", }

instead of {"reason": "the provided context discusses the Andes mountain range, which, while impressive, does not include Mount Everest or directly relate to the question about the world's tallest mountain.", "verdict": "0"}

Feb 15 '24 15:02 stepoibm

Hey team,

Just checking in because I'm a bit stuck. Not sure if I'm missing something obvious or if it's something you guys are already fixing up. Would appreciate any pointers or info you could share.

Feb 19 '24 04:02 ajaychinni

adding @shahules786 here for his expertise

this does seem like something from our side, mostly because the model is not following JSON format as expected. This is however a blocker though

Feb 19 '24 05:02 jjmachan

I've tried to use mistralai/Mixtral-8x7B-Instruct-v0.1 but I'm still encountering the same issue. Although Mixtral is comparable to Chat GPT 3.5, the problem persists.

Feb 19 '24 06:02 ajaychinni

@ajaychinni I've "fixed" the issues on a fork for myself: https://github.com/explodinggradients/ragas/pull/633

This should fix your "Attributed" and "verdict" issues

Feb 19 '24 14:02 stepoibm

@stepoibm Thanks for the update. I just checked your Fork but I am still getting the error and this time checked only for "context_precision"

Invalid response format. Expected a list of dictionaries with keys 'verdict'. Responses [{}]

I am using TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ with max-model-len 4096

Feb 20 '24 06:02 ajaychinni

Same problem with models like gpt2 and google/flan-t5-large from hugging face. My code is:

from langchain_community.llms import HuggingFaceHub model = HuggingFaceHub(repo_id="google/flan-t5-large", huggingfacehub_api_token="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") print(evaluate(dataset, llm=model))

I am getting: /usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice value = np.nanmean(self.scores[cn]) {'answer_relevancy': nan, 'context_precision': nan, 'faithfulness': nan, 'context_recall': nan}

The same dataset works with openai.

With other models like gpt2 in HF I am getting: Evaluating: 0%| | 0/4 [00:00<?, ?it/s] Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/executor.py", line 75, in run results = self.loop.run_until_complete(self._aresults()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/executor.py", line 63, in _aresults raise e File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/executor.py", line 58, in _aresults r = await future ^^^^^^^^^^^^ File "/usr/lib/python3.11/asyncio/tasks.py", line 615, in _wait_for_one return f.result() # May raise f.exception(). ^^^^^^^^^^ File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/executor.py", line 91, in wrapped_callable_async return counter, await callable(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/metrics/base.py", line 91, in ascore raise e File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/metrics/base.py", line 87, in ascore score = await self._ascore(row=row, callbacks=group_cm, is_async=is_async) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/metrics/_faithfulness.py", line 190, in _ascore assert isinstance(statements, dict), "Invalid JSON response" AssertionError: Invalid JSON response Traceback (most recent call last): File "/usr/local/google/home/yvirin/rag/evaluation/test_ragas.py", line 60, in main() File "/usr/local/google/home/yvirin/rag/evaluation/test_ragas.py", line 56, in main print(evaluate(dataset, llm=model)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Feb 22 '24 20:02 yan-virin

Is there someone who was able to run RAGAS without openai api key successfully? If so, could you please share your setup?

Feb 23 '24 22:02 yan-virin

hey do take a look at the docs here Bring Your Own LLMs and Embeddings | Ragas

that should help you out @yan-virin or are you facing the json issue?

Feb 27 '24 06:02 jjmachan

@jjmachan While technically using other models should work, in practice it seems like it just does not work. I am wondering if there is anyone in the community who actually is using other model or not. The json issue is more profound than you think. I think other LLMs are not responding well to your prompts. The code in the metrics needs to be adapted to each LLM separately.

Feb 27 '24 17:02 yan-virin

@yan-virin please do update if you did manage to run other models successfully, I have been trying using Llama2 from chatOllama (part of langchain) as llm and OllamaEmbeddings(model="llama2") as embedding model and only context_recall metric seems to be working. Other metrics give nan as output

Mar 05 '24 06:03 Ritesh-2001

We are using mixtral.

The issue is that emitting the expected json perfectly is super fragile for other models. That's not a RAGAS issue, but a model issue.

LLama didnt work for us, only mixtral8x7b seems to have limited success.

For RAGAS with mixtral we had to change the issues in the prompts, such as context_recall expecting a capitalized json key "Attributed" which is non-standard or giving a "classification" array OR dictionary as few shot examples.

Mar 05 '24 06:03 stepoibm

Same issue with Bedrock models, e.g. "anthropic.claude-v2"

Mar 05 '24 21:03 ogencoglu

I am also facing a similar isssue, any idea how to use an open llm model with hugging face for ragas evaluation ?

Mar 26 '24 00:03 larunach

hey do take a look at the docs here Bring Your Own LLMs and Embeddings | Ragas

that should help you out @yan-virin or are you facing the json issue?

This seems to be a recurring issue, any fix for using custom llms ? Most of them getting only nan as output

Mar 26 '24 22:03 larunach