Getting Error: Runner in Executor raised an exception in Ragas evaluate using Ollama and giving Nan value in df
I encountered an issue while evaluating a dataset using the ragas library with the Langchain LLM and Sentence Transformer embeddings. The process throws an exception during execution. Steps to Reproduce:
#Reading the DataFrame:
import pandas as pd
df = pd.read_csv("test_results_csv/test.csv")
df["contexts"] = df["contexts"].apply(lambda x: [x])
from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset[0]
Output:
{
'question': 'What are the major sources of carbohydrates in the traditional Hawaiian diet?',
'ground_truth': "The traditional Hawaiian diet was rich in carbohydrate sources primarily derived from 'uala (sweet potato), ulu (breadfruit), and kalo (taro). These foods were not only staple items but also provided a significant portion of the daily caloric intake. Sweet potatoes, breadfruit, and taro were cultivated extensively and formed the backbone of the Hawaiian nutritional intake, ensuring that the population had a steady and reliable source of energy. The high carbohydrate content of these foods supported the physical demands of daily activities and agricultural work.",
'answer': ' The majority of the diet was made up of these fiber rich carbohydrate foods.',
'contexts': ["['• Describe the different types of simple and complex carbohydrates • Describe the process of carbohydrate digestion and absorption • Describe the functions of carbohydrates in the body • Describe the body’s carbohydrate needs and how personal choices can lead to health benefits or consequences Throughout history, carbohydrates have and continue to be a major source of people’s diets worldwide. In ancient Hawai‘i the Hawaiians obtained the majority of their calories from carbohydrate rich plants like the ‘uala (sweet potato), ulu (breadfruit) and kalo (taro). For example, mashed kalo or poi was a staple to meals for Hawaiians. Research suggests that almost 78 percent of the diet was made up of these fiber rich carbohydrate foods.1 Carbohydrates are the perfect nutrient to meet your body’s nutritional needs. They nourish your brain and nervous system, provide energy to all of your cells when within proper caloric limits, and help keep your body fit and lean.', 'body fit and lean. Specifically, digestible carbohydrates provide bulk in foods, vitamins, and minerals, while 1.\\xa0Fujita R, Braun KL, Hughes CK. (2004). The traditional Hawaiian diet: a review of the literature. Pacific Health Dialogue, 11(2). http:/ [/pacifichealthdialog.org.fj/](https://file+.vscode-resource.vscode-cdn.net/pacifichealthdialog.org.fj/) Volume2011/no2/ PHD1120220p2162022120Yamada20orig.pdf. Accessed October 19, 2017. 230 | Introduction']"]
}
Setting Up the Environment:
from langchain_community.embeddings.sentence_transformer import (
SentenceTransformerEmbeddings,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.llms import Ollama
llm = Ollama(model="mistrallite_Q2_K:latest", temperature=0)
embeddings = SentenceTransformerEmbeddings(model_name='all-MiniLM-L6-v2')
langchain_llm = LangchainLLMWrapper(llm)
langchain_embeddings = LangchainEmbeddingsWrapper(embeddings)
import nest_asyncio
nest_asyncio.apply()
from ragas import evaluate
from ragas.run_config import RunConfig
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
)
result = evaluate(
dataset=dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy,
],
raise_exceptions=False,
llm=langchain_llm,
embeddings=langchain_embeddings
)
res_df = result.to_pandas()
Error
Runner in Executor raised an exception
Traceback (most recent call last):
File "/Users/xxx/miniconda3/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
File "/Users/xxx/miniconda3/lib/python3.12/site-packages/ragas/metrics/_context_recall.py", line 169, in _ascore
results = await self.llm.generate(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxx/miniconda3/lib/python3.12/site-packages/ragas/llms/base.py", line 93, in generate
return await agenerate_text_with_retry(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 189, in async_wrapped
return await copy(fn, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 111, in __call__
do = await self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
result = await action(retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/_utils.py", line 99, in inner
return call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxx/miniconda3/lib/python3.12/site-packages/tenacity/__init__.py", line 398, in <lambda>
self._add_action_func(lambda rs: rs.outcome.result())
Hey there, @divrajput! I'm here to help you with the issue you've encountered. Let's work together to solve this bug and get things back on track!
To resolve the issue of encountering an exception and resulting in NaN values in the DataFrame while evaluating a dataset using the ragas library with the Langchain LLM and Sentence Transformer embeddings, you can set the raise_exceptions parameter to False in the evaluate function. This will ensure that the evaluation will return np.nan for the row that failed instead of raising an exception.
Here is the relevant part of the code where you can set this parameter:
result = evaluate(
dataset=dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy,
],
raise_exceptions=False, # Set this to False to handle exceptions gracefully
llm=langchain_llm,
embeddings=langchain_embeddings
)
By setting raise_exceptions to False, the evaluation will handle exceptions gracefully and return np.nan for any rows that encounter issues [1].
Additionally, ensure that your RunConfig is properly set up to handle retries and timeouts, which can help mitigate transient errors:
from ragas.run_config import RunConfig
run_config = RunConfig(
max_retries=10,
max_wait=60,
thread_timeout=300,
log_tenacity=True
)
result = evaluate(
dataset=dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy,
],
raise_exceptions=False,
llm=langchain_llm,
embeddings=langchain_embeddings,
run_config=run_config # Pass the run_config here
)
This configuration will help manage retries and timeouts more effectively [2][1].
To continue talking to Dosu, mention @dosu.
While evaluating my dataset, I am encountering NaN values for the metrics faithfulness and answer_relevancy. However, I am successfully getting values for context_precision and context_recall. I have increased the run configuration values to ensure the process has sufficient retries and wait times:
run_config = RunConfig(
max_retries=15,
max_wait=120,
thread_timeout=500,
log_tenacity=True
)
Output:
| question | ground_truth | answer | contexts | context_precision | context_recall | faithfulness | answer_relevancy |
|---|---|---|---|---|---|---|---|
| What are the major sources of carbohydrates in... | The traditional Hawaiian diet was rich in carb... | The majority of the diet was made up of these... | [['• Describe the different types of simple a... | 1.0 | 1.0 | NaN | 0.552834 |
| Why are carbohydrates considered essential for... | Carbohydrates are essential for the human body... | Carbohydrates provide a ready source of energ... | [['Protein Necessary for tissue formation, ce... | 1.0 | 1.0 | NaN | NaN |
Request: Please guide on resolving the NaN values for faithfulness and answer_relevancy metrics.
The NaN values for the faithfulness and answer_relevancy metrics are likely due to issues in the statement generation or parsing process. Specifically, the _create_statements_prompt and _create_nli_prompt methods, or the LLM's response to these prompts, might be causing the problem.
Here are a few steps to debug and resolve the issue:
-
Check the LLM's Response: Ensure that the LLM is generating valid responses for the prompts created in
_create_statements_promptand_create_nli_prompt. -
Verify Statement Parsing: Ensure that the statements generated by the LLM are correctly parsed by
_statements_output_parserand_faithfulness_output_parser. -
Inspect the Prompts: Verify that the prompts created in
_create_statements_promptand_create_nli_promptare correctly formatted and contain the necessary information.
Here is a modified version of the _ascore method with additional logging to help debug the issue:
async def _ascore(
self: t.Self, row: t.Dict, callbacks: Callbacks, is_async: bool
) -> float:
"""
returns the NLI score for each (q, c, a) pair
"""
assert self.llm is not None, "LLM is not set"
# Create statements prompt
p_value = self._create_statements_prompt(row)
statements = await self.llm.generate(
p_value,
callbacks=callbacks,
is_async=is_async,
)
statements = await _statements_output_parser.aparse(
statements.generations[0][0].text, p_value, self.llm, self.max_retries
)
if statements is None:
logger.warning("No statements generated by the LLM.")
return np.nan
statements = [item["simpler_statements"] for item in statements.dicts()]
statements = [item for sublist in statements for item in sublist]
assert isinstance(statements, t.List), "statements must be a list"
# Create NLI prompt
p_value = self._create_nli_prompt(row, statements)
nli_result = await self.llm.generate(
p_value,
callbacks=callbacks,
is_async=is_async,
n=self._reproducibility,
)
nli_result_text = [
nli_result.generations[0][i].text for i in range(self._reproducibility)
]
faithfulness_list = [
await _faithfulness_output_parser.aparse(
text, p_value, self.llm, self.max_retries
)
for text in nli_result_text
]
faithfulness_list = [
faith.dicts() for faith in faithfulness_list if faith is not None
]
if faithfulness_list:
faithfulness_list = ensembler.from_discrete(
faithfulness_list,
"verdict",
)
faithfulness_list = StatementFaithfulnessAnswers.parse_obj(
faithfulness_list
)
else:
logger.warning("No faithfulness results generated by the LLM.")
return np.nan
return self._compute_score(faithfulness_list)
Additionally, for the answer_relevancy metric, ensure that the LLM is correctly generating questions and identifying noncommittal answers. Here is the relevant code snippet for answer_relevancy:
@dataclass
class AnswerRelevancy(MetricWithLLM, MetricWithEmbeddings):
name: str = "answer_relevancy"
evaluation_mode: EvaluationMode = EvaluationMode.qac
question_generation: Prompt = field(default_factory=lambda: QUESTION_GEN)
strictness: int = 3
def calculate_similarity(
self: t.Self, question: str, generated_questions: list[str]
):
assert self.embeddings is not None
question_vec = np.asarray(self.embeddings.embed_query(question)).reshape(1, -1)
gen_question_vec = np.asarray(
self.embeddings.embed_documents(generated_questions)
).reshape(len(generated_questions), -1)
norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(
question_vec, axis=1
)
return (
np.dot(gen_question_vec, question_vec.T).reshape(
-1,
)
/ norm
)
def _calculate_score(
self, answers: t.Sequence[AnswerRelevanceClassification], row: t.Dict
) -> float:
question = row["question"]
gen_questions = [answer.question for answer in answers]
committal = np.any([answer.noncommittal for answer in answers])
if all(q == "" for q in gen_questions):
logger.warning(
"Invalid JSON response. Expected dictionary with key 'question'"
)
score = np.nan
else:
cosine_sim = self.calculate_similarity(question, gen_questions)
score = cosine_sim.mean() * int(not committal)
return score
def _create_question_gen_prompt(self, row: t.Dict) -> PromptValue:
ans, ctx = row["answer"], row["contexts"]
return self.question_generation.format(answer=ans, context="\n".join(ctx))
async def _ascore(self, row: t.Dict, callbacks: Callbacks, is_async: bool) -> float:
assert self.llm is not None, "LLM is not set"
prompt = self._create_question_gen_prompt(row)
result = await self.llm.generate(
prompt,
n=self.strictness,
callbacks=callbacks,
is_async=is_async,
)
answers = [
await _output_parser.aparse(result.text, prompt, self.llm)
for result in result.generations[0]
]
if any(answer is None for answer in answers):
return np.nan
answers = [answer for answer in answers if answer is not None]
return self._calculate_score(answers, row)
This code includes additional logging to help identify where the process might be failing. If the LLM is not generating valid statements or NLI results, the logs will indicate this, allowing you to further investigate the issue [1][2].
To continue talking to Dosu, mention @dosu.
@divrajput it will be most likely due to json following capabilites of the model. Do you have any tracing tools setup?