ragas Invalid response/JSON format for Evaluation

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug When I want to measure the success of the Embedding (BAAI/bge-large-en) and model (zephyr-7b-alpha) model that I added via Huggingface with the Ragas evaluator, I get the following invalid response and verdict error. The interesting thing is that these 3 metrics were working 16-17 hours ago with version of 0.1.4.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
WARNING:ragas.metrics._context_precision:Invalid response format. Expected a list of dictionaries with keys 'verdict'
WARNING:ragas.metrics._answer_relevance:Invalid JSON response. Expected dictionary with key 'question'
WARNING:ragas.metrics._context_precision:Invalid response format. Expected a list of dictionaries with keys 'verdict'
WARNING:ragas.metrics._context_recall:Invalid JSON response. Expected dictionary with key 'Attributed'
WARNING:ragas.metrics._context_recall:Invalid JSON response. Expected dictionary with key 'Attributed'
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
WARNING:ragas.metrics._answer_relevance:Invalid JSON response. Expected dictionary with key 'question'
/usr/local/lib/python3.10/dist-packages/ragas/evaluation.py:276: RuntimeWarning: Mean of empty slice
  value = np.nanmean(self.scores[cn])
{'context_precision': nan, 'answer_relevancy': nan, 'context_recall': nan}

Ragas version: 0.1.5 Python version: 3.10.12

Code to Reproduce

My evaluate function call:

from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

result_2 = evaluate(
    amnesty_qa2,
    llm = llm,
    embeddings = embeddings,
    metrics=[
        context_precision,
        # faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

result_2

Embedding:

from langchain_community.embeddings import HuggingFaceEmbeddings

device = "cuda"
model_kwargs = {'device': device}
encode_kwargs = {'normalize_embeddings': True}  

model_name = "BAAI/bge-large-en"

embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    multi_process=True,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

LLM initialization:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "HuggingFaceH4/zephyr-7b-alpha"

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_use_double_quant = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             quantization_config=bnb_config
                                             )
tokenizer = AutoTokenizer.from_pretrained(model_name)


from langchain.llms import HuggingFacePipeline
from transformers import pipeline

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.01,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=512,
    batch_size=2,
)


llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

Error trace

Expected behavior I was expecting it works without any response format error. It worked once in a way I desire.

Additional context How my dataset looks like:

amnesty_qa2 = load_dataset("explodinggradients/amnesty_qa", "english_v2", split='eval[:10%]')

Dataset({
    features: ['question', 'ground_truth', 'answer', 'contexts'],
    num_rows: 2
})

Langchain Version: 0.1.13

Mar 21 '24 00:03 robuno

Hey @robuno I am not sure if Zyphyr model is good enough to produce valid JSON outputs.

Mar 21 '24 01:03 shahules786

Hey @shahules786, thank you for your response. I choose Zephyr model since it is shown in Ragas docs and one issue that it works.

I also checked zephyr-7b-alpha works well yesterday:

Today, I tried a few models "google/gemma-2b" and "HuggingFaceH4/zephyr-7b-beta" but they gave this JSON format error either. I'm actually interested in finding a way to use small HFace models in the evaluation process but could not find suitable models for my purpose.

Thank you for your interest, I'm open for the ideas.

Mar 21 '24 14:03 robuno

Same issue. Any updates on this bug ?

Mar 26 '24 22:03 larunach