Invalid response/JSON format for Evaluation
[ ] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug When I want to measure the success of the Embedding (BAAI/bge-large-en) and model (zephyr-7b-alpha) model that I added via Huggingface with the Ragas evaluator, I get the following invalid response and verdict error. The interesting thing is that these 3 metrics were working 16-17 hours ago with version of 0.1.4.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
WARNING:ragas.metrics._context_precision:Invalid response format. Expected a list of dictionaries with keys 'verdict'
WARNING:ragas.metrics._answer_relevance:Invalid JSON response. Expected dictionary with key 'question'
WARNING:ragas.metrics._context_precision:Invalid response format. Expected a list of dictionaries with keys 'verdict'
WARNING:ragas.metrics._context_recall:Invalid JSON response. Expected dictionary with key 'Attributed'
WARNING:ragas.metrics._context_recall:Invalid JSON response. Expected dictionary with key 'Attributed'
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
WARNING:ragas.metrics._answer_relevance:Invalid JSON response. Expected dictionary with key 'question'
/usr/local/lib/python3.10/dist-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice
value = np.nanmean(self.scores[cn])
{'context_precision': nan, 'answer_relevancy': nan, 'context_recall': nan}
Ragas version: 0.1.5 Python version: 3.10.12
Code to Reproduce
My evaluate function call:
from ragas import evaluate
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_recall,
context_precision,
)
result_2 = evaluate(
amnesty_qa2,
llm = llm,
embeddings = embeddings,
metrics=[
context_precision,
# faithfulness,
answer_relevancy,
context_recall,
],
)
result_2
Embedding:
from langchain_community.embeddings import HuggingFaceEmbeddings
device = "cuda"
model_kwargs = {'device': device}
encode_kwargs = {'normalize_embeddings': True}
model_name = "BAAI/bge-large-en"
embeddings = HuggingFaceEmbeddings(
model_name=model_name,
multi_process=True,
model_kwargs = model_kwargs,
encode_kwargs = encode_kwargs
)
LLM initialization:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_name = "HuggingFaceH4/zephyr-7b-alpha"
bnb_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_use_double_quant = True,
bnb_4bit_quant_type = "nf4",
bnb_4bit_compute_dtype = torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(model_name,
quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
text_generation_pipeline = pipeline(
model=model,
tokenizer=tokenizer,
task="text-generation",
temperature=0.01,
do_sample=True,
repetition_penalty=1.1,
return_full_text=True,
max_new_tokens=512,
batch_size=2,
)
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
Error trace
Expected behavior I was expecting it works without any response format error. It worked once in a way I desire.
Additional context How my dataset looks like:
amnesty_qa2 = load_dataset("explodinggradients/amnesty_qa", "english_v2", split='eval[:10%]')
Dataset({
features: ['question', 'ground_truth', 'answer', 'contexts'],
num_rows: 2
})
Langchain Version: 0.1.13
Hey @robuno I am not sure if Zyphyr model is good enough to produce valid JSON outputs.
Hey @shahules786, thank you for your response. I choose Zephyr model since it is shown in Ragas docs and one issue that it works.
I also checked zephyr-7b-alpha works well yesterday:
Today, I tried a few models "google/gemma-2b" and "HuggingFaceH4/zephyr-7b-beta" but they gave this JSON format error either. I'm actually interested in finding a way to use small HFace models in the evaluation process but could not find suitable models for my purpose.
Thank you for your interest, I'm open for the ideas.
Same issue. Any updates on this bug ?