ragas RAGAS with huggingface models

Describe the bug A clear and concise description of what the bug is.

I tried using RAGAS with a model that is not OpenAI. In general whatever model I use I get this error back:

File /opt/conda/lib/python3.10/site-packages/ragas/evaluation.py:237, in evaluate(dataset, metrics, llm, embeddings, callbacks, in_ci, is_async, run_config, raise_exceptions, column_map)
    235 results = executor.results()
    236 if results == []:
--> 237     raise ExceptionInRunner()
    239 # convert results to dataset_like
    240 for i, _ in enumerate(dataset):

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass raise_exceptions=False incase you want to show only a warning message instead.

/opt/conda/lib/python3.10/site-packages/ipykernel/iostream.py:123: RuntimeWarning: coroutine 'as_completed.<locals>.sema_coro' was never awaited
  await self._event_pipe_gc()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

Which I solved using this:

import nest_asyncio
nest_asyncio.apply()

However, it is not returning error but it is returning: {'faithfulness': nan, 'answer_relevancy': nan, 'context_utilization': nan}

Code to Reproduce

import pandas as pd
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from langchain import HuggingFacePipeline
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_utilization
)
from ragas import evaluate
from datasets import Dataset

import nest_asyncio
nest_asyncio.apply()

# embedding model
embedding_model = SentenceTransformer("microsoft/mpnet-base")

# evaluator
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

device = 0  # Use GPU (0 is typically the first GPU device)

pipe = pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    temperature=0.1,
    do_sample=True,
    max_new_tokens = 200,
    repetition_penalty=1.1  # without this output begins repeating

)

evaluator = HuggingFacePipeline(pipeline=pipe)

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
}
dataset = Dataset.from_dict(data_samples)

# ragas
result = evaluate(
    dataset=dataset,
    llm=evaluator,
    embeddings=embedding_model,
    raise_exceptions=False,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_utilization,
    ]
)

print(result)

Error trace No error, but basically is not working Expected behavior It should return the evaluation metrics

Thank you very much for your help

Jul 11 '24 13:07 SalvatoreRa

Hey there, @SalvatoreRa! I'm Dosu, a bot here to assist you with any bugs, questions, or contributions. I'm here to help while you wait for a human maintainer. Let's tackle this issue together!

To resolve the issue of receiving nan for all evaluation metrics when using RAGAS with a HuggingFace model, ensure that the dataset is correctly formatted and all required columns are present and correctly named. The dataset should have the following columns: question, ground_truth, answer, and contexts. Additionally, make sure that the column names match the expected names or provide a column_map to map your dataset's column names to the expected ones.

Here is an example of how to use the evaluate function with a properly formatted dataset:

from ragas import evaluate
from datasets import Dataset

# Example dataset
data = {
    'question': ["What is the capital of France?", "Who wrote '1984'?"],
    'ground_truth': [["Paris"], ["George Orwell"]],
    'answer': ["Paris", "George Orwell"],
    'contexts': [["Paris is the capital of France."], ["George Orwell wrote '1984'."]]
}

dataset = Dataset.from_dict(data)

# Evaluate the dataset
result = evaluate(dataset)
print(result)

Ensure that the dataset columns are correctly named or use the column_map parameter to map your dataset's column names to the expected ones:

column_map = {
    'question': 'your_question_column',
    'ground_truth': 'your_ground_truth_column',
    'answer': 'your_answer_column',
    'contexts': 'your_contexts_column'
}

result = evaluate(dataset, column_map=column_map)
print(result)

If the dataset is correctly formatted and the issue persists, check if the metrics, LLM, and embeddings are properly initialized and not None ^[1].

To continue talking to Dosu, mention @dosu.

Jul 11 '24 14:07 dosubot[bot]

I have tested with the mapping, but the results are the same. The dataset is already in the good format:

amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")

Jul 12 '24 07:07 SalvatoreRa

U have to downgrade ur ragas version to 0.1.7 then it should be working

Jul 22 '24 06:07 TM02

I have tried, looks like it takes a lot of time and then just go out of memory. I tried this script in Google Colab:

!pip install datasets sentence_transformers langchain ragas==0.1.7 accelerate bitsandbytes langchain-huggingface
from datasets import load_dataset
import pandas as pd
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from langchain import HuggingFacePipeline
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_utilization
)
from ragas import evaluate
from datasets import Dataset

import nest_asyncio
nest_asyncio.apply()

# embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2', device='cuda')

model_id = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto",  # Automatically map layers to GPU(s)
    #attn_implementation="flash_attention_2", # if you have an ampere GPU
)

# Define the text generation pipeline using HuggingFace transformers
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=500, top_k=50, temperature=0.1)  # Use GPU (0 is typically the first GPU device)

# Wrap the pipeline in a HuggingFacePipeline object
model  = HuggingFacePipeline(pipeline=pipe)



evaluator = HuggingFacePipeline(pipeline=pipe)

amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
dataset = amnesty_qa["eval"].select(range(5))

column_map = {
    'question': 'question',
    'ground_truth': 'ground_truth',
    'answer': 'answer',
    'contexts': 'contexts'
}

# ragas
result = evaluate(
    dataset=dataset,
    llm=evaluator,
    embeddings=embedding_model,
    raise_exceptions=False,
    column_map=column_map,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_utilization,
    ]
)

print(result)

Jul 23 '24 07:07 SalvatoreRa

Ragas with huggingface models don't work as indented because they don't work with async as we expect. could you try out ollama or vllm?

Jul 31 '24 07:07 jjmachan

Ragas with huggingface models don't work as indented because they don't work with async as we expect. could you try out ollama or vllm?

I actually tried before this but I did not work: https://github.com/mosh98/RAG_With_Models/blob/main/evaluation/RAGAS%20DEMO.ipynb

do you have in case some code snippet?

Jul 31 '24 09:07 SalvatoreRa

@SalvatoreRa in that example it seems working right? or did I missing something?

Aug 06 '24 06:08 jjmachan

Notebook does not work and also the notebook does not have any versions mentioned for packages. Can we atleast have a req.txt file for examples?

Nov 11 '25 20:11 arjunsridhar9720