Invalid response format. Expected a list of dictionaries with keys 'verdict'
Ragas version: 0.1.1 Python version: 3.8
Describe the bug
Getting all NaN scores. I am using open-source LLM and hosting in vLLM openAI server style, using HuggingfaceEmbeddings and creating a custom sample dataset.
Code to Reproduce
successfully ran vllm
python -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-alpha --tensor-parallel-size 4 --port 8000 --enforce-eager --gpu-memory-utilization 0.99
from langchain_openai.chat_models import ChatOpenAI
from datasets import Dataset
inference_server_url = "http://localhost:8000/v1"
#create vLLM Langchain instance
chat = ChatOpenAI(
model="HuggingFaceH4/zephyr-7b-alpha",
openai_api_key="no-key",
openai_api_base=inference_server_url,
max_tokens=5,
temperature=0,
)
from ragas.embeddings import HuggingfaceEmbeddings
hf_embeddings = HuggingfaceEmbeddings(model_name="BAAI/bge-small-en")
#My dataset looks like this
data = {
'question': ['What did the president say about gun violence?'],
'answer': ['I do not have access to the most recent statements made by the president. please provide me with the specific date or context of the statement you are referring to.'],
'contexts': [['The president asked Congress to pass proven measures to reduce gun violence.']],
'ground_truth': ['The president asked Congress to pass proven measures to reduce gun violence.']
}
# Convert dict to dataset
dataset = Dataset.from_dict(data)
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_recall,
context_precision,
)
from ragas import evaluate
result = evaluate(dataset = dataset,
llm = chat,
embeddings = hf_embeddings,
metrics=[context_precision,
context_recall,
faithfulness,
answer_relevancy,
]
)
df = result.to_pandas()
Error trace
Invalid response format. Expected a list of dictionaries with keys 'verdict' Invalid JSON response. Expected dictionary with key 'Attributed' Invalid JSON response. Expected dictionary with key 'question' /usr/local/lib/python3.8/dist-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice value = np.nanmean(self.scores[cn])
I've got the same issue with mixtral 8x7B.
Sometimes the llm will generate
classification: [{"statement_1": "England won the 2022 ICC Men's T20 World Cup.", "reason": "From context it is clear that England defeated Pakistan to win the World Cup.", "Attributed": "1"}]
instead of the expected
[{"statement_1": "England won the 2022 ICC Men's T20 World Cup.", "reason": "From context it is clear that England defeated Pakistan to win the World Cup.", "Attributed": "1"}]
or
"verification": { "reason": "the provided context discusses the Andes mountain range, which, while impressive, does not include Mount Everest or directly relate to the question about the world's tallest mountain.", "verdict": "0", }
instead of
{"reason": "the provided context discusses the Andes mountain range, which, while impressive, does not include Mount Everest or directly relate to the question about the world's tallest mountain.", "verdict": "0"}
Hey team,
Just checking in because I'm a bit stuck. Not sure if I'm missing something obvious or if it's something you guys are already fixing up. Would appreciate any pointers or info you could share.
adding @shahules786 here for his expertise
this does seem like something from our side, mostly because the model is not following JSON format as expected. This is however a blocker though
I've tried to use mistralai/Mixtral-8x7B-Instruct-v0.1 but I'm still encountering the same issue. Although Mixtral is comparable to Chat GPT 3.5, the problem persists.
@ajaychinni I've "fixed" the issues on a fork for myself: https://github.com/explodinggradients/ragas/pull/633
This should fix your "Attributed" and "verdict" issues
@stepoibm Thanks for the update. I just checked your Fork but I am still getting the error and this time checked only for "context_precision"
Invalid response format. Expected a list of dictionaries with keys 'verdict'. Responses [{}]
I am using TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ with max-model-len 4096
Same problem with models like gpt2 and google/flan-t5-large from hugging face. My code is:
from langchain_community.llms import HuggingFaceHub model = HuggingFaceHub(repo_id="google/flan-t5-large", huggingfacehub_api_token="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") print(evaluate(dataset, llm=model))
I am getting: /usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice value = np.nanmean(self.scores[cn]) {'answer_relevancy': nan, 'context_precision': nan, 'faithfulness': nan, 'context_recall': nan}
The same dataset works with openai.
With other models like gpt2 in HF I am getting:
Evaluating: 0%| | 0/4 [00:00<?, ?it/s]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/executor.py", line 75, in run
results = self.loop.run_until_complete(self._aresults())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/executor.py", line 63, in _aresults
raise e
File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/executor.py", line 58, in _aresults
r = await future
^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/tasks.py", line 615, in _wait_for_one
return f.result() # May raise f.exception().
^^^^^^^^^^
File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/executor.py", line 91, in wrapped_callable_async
return counter, await callable(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/metrics/base.py", line 91, in ascore
raise e
File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/metrics/base.py", line 87, in ascore
score = await self._ascore(row=row, callbacks=group_cm, is_async=is_async)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/yvirin/ragas_env/lib/python3.11/site-packages/ragas/metrics/_faithfulness.py", line 190, in _ascore
assert isinstance(statements, dict), "Invalid JSON response"
AssertionError: Invalid JSON response
Traceback (most recent call last):
File "/usr/local/google/home/yvirin/rag/evaluation/test_ragas.py", line 60, in
Is there someone who was able to run RAGAS without openai api key successfully? If so, could you please share your setup?
hey do take a look at the docs here Bring Your Own LLMs and Embeddings | Ragas
that should help you out @yan-virin or are you facing the json issue?
@jjmachan While technically using other models should work, in practice it seems like it just does not work. I am wondering if there is anyone in the community who actually is using other model or not. The json issue is more profound than you think. I think other LLMs are not responding well to your prompts. The code in the metrics needs to be adapted to each LLM separately.
@yan-virin please do update if you did manage to run other models successfully, I have been trying using Llama2 from chatOllama (part of langchain) as llm and OllamaEmbeddings(model="llama2") as embedding model and only context_recall metric seems to be working. Other metrics give nan as output
We are using mixtral.
The issue is that emitting the expected json perfectly is super fragile for other models. That's not a RAGAS issue, but a model issue.
LLama didnt work for us, only mixtral8x7b seems to have limited success.
For RAGAS with mixtral we had to change the issues in the prompts, such as context_recall expecting a capitalized json key "Attributed" which is non-standard or giving a "classification" array OR dictionary as few shot examples.
Same issue with Bedrock models, e.g. "anthropic.claude-v2"
I am also facing a similar isssue, any idea how to use an open llm model with hugging face for ragas evaluation ?
hey do take a look at the docs here Bring Your Own LLMs and Embeddings | Ragas
that should help you out @yan-virin or are you facing the json issue?
This seems to be a recurring issue, any fix for using custom llms ? Most of them getting only nan as output