Faithfullness returns always Nan with a local LLM
[ x] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug Faithfullness is not working returning always Nan, sometimes also the context precision
Python version: 3.11.9 Ragas version: 0.1.15
Code to Reproduce
data_samples_2 = { 'question': ["What are the dimensions of twist?"], 'answer': ["The limits for the mechanical characteristics of twist shall be ≤ 1.5% for board thickness < 1.6 mm and ≤ 1.1% for board thickness ≥ 1.6 mm."], 'contexts': [ ["Twist shall be measured in conformance with the test method 2.4.22c from IPC-TM-650.", "The PCB shall be placed on a horizontal surface so that it rests on three corners.", "The distance between the horizontal surface and the fourth corner of the PCB shall be measured as specified in Figure 9-3.", "The length of the diagonal of the PCB shall be measured.", "The twist shall be expressed in percentage terms.", "Twist shall be calculated as follows: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100", "The maximum twist shall be ≤ 1,5 %.", "The procurement authority may specify a more stringent requirement for twist in the PCB definition dossier."] ], 'ground_truth': ["""Based on the information provided in the document, the key dimensions related to twist in PCBs are:
- The twist is measured as the distance between the horizontal surface and the fourth corner of the PCB when it rests on three corners.
- The length of the diagonal of the PCB is measured.
- The twist is expressed as a percentage and calculated using the following formula: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100
- The maximum allowable twist is specified as ≤ 1.5%.
- The document mentions that the procurement authority may specify a more stringent requirement for twist in the PCB definition dossier, with a typical stringent twist requirement being ≤ 0.75%. So the key dimensions are:
- The maximum distance between the fourth corner and the horizontal surface
- The length of the PCB diagonal
- The calculated percentage of twist
- The maximum allowable percentage (1.5% or potentially lower) The document also includes a diagram illustrating how twist is measured on a PCB."""] }
from langchain_community.vectorstores import FAISS from langchain_community.vectorstores import Chroma from langchain.text_splitter import CharacterTextSplitter from langchain_community.embeddings import OpenAIEmbeddings from langchain_community.embeddings import HuggingFaceBgeEmbeddings from langchain.chains import RetrievalQA import os import time from langchain.llms import Ollama from langchain.callbacks.manager import CallbackManagerForLLMRun from langchain.embeddings import HuggingFaceEmbeddings from typing import Any, List, Optional
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
Initialize Ollama LLM
llm = Ollama(model="llama3.1:latest")
Initialize Sentence Transformers embedding model
embedding_model = HuggingFaceEmbeddings( model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': True} # set True to compute cosine similarity )
from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper
llm = LangchainLLMWrapper(llm) embedding_model = LangchainEmbeddingsWrapper(embedding_model)
from datasets import Dataset from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall from ragas import evaluate
dataset = Dataset.from_dict(data_samples_2)
score = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=llm, embeddings=embedding_model) score.to_pandas() print(score)
Error trace RAGAS_FOSCO_06_09 Last Checkpoint: yesterday [Python 3 (ipykernel)] import sys import ragas print(f"Python version: {sys.version}")
Versione di Ragas
print(f"Ragas version: {ragas.version}")
Python version: 3.11.9 (main, Apr 24 2024, 09:31:52) [GCC 14.0.1 20240411 (Red Hat 14.0.1-0)] Ragas version: 0.1.15
#ECSS-Q-ST-70-60-C data_samples_2 = { 'question': ["What are the dimensions of twist?"], 'answer': ["The limits for the mechanical characteristics of twist shall be ≤ 1.5% for board thickness < 1.6 mm and ≤ 1.1% for board thickness ≥ 1.6 mm."], 'contexts': [ ["Twist shall be measured in conformance with the test method 2.4.22c from IPC-TM-650.", "The PCB shall be placed on a horizontal surface so that it rests on three corners.", "The distance between the horizontal surface and the fourth corner of the PCB shall be measured as specified in Figure 9-3.", "The length of the diagonal of the PCB shall be measured.", "The twist shall be expressed in percentage terms.", "Twist shall be calculated as follows: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100", "The maximum twist shall be ≤ 1,5 %.", "The procurement authority may specify a more stringent requirement for twist in the PCB definition dossier."] ], 'ground_truth': ["""Based on the information provided in the document, the key dimensions related to twist in PCBs are:
- The twist is measured as the distance between the horizontal surface and the fourth corner of the PCB when it rests on three corners.
- The length of the diagonal of the PCB is measured.
- The twist is expressed as a percentage and calculated using the following formula: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100
- The maximum allowable twist is specified as ≤ 1.5%.
- The document mentions that the procurement authority may specify a more stringent requirement for twist in the PCB definition dossier, with a typical stringent twist requirement being ≤ 0.75%. So the key dimensions are:
- The maximum distance between the fourth corner and the horizontal surface
- The length of the PCB diagonal
- The calculated percentage of twist
- The maximum allowable percentage (1.5% or potentially lower) The document also includes a diagram illustrating how twist is measured on a PCB."""] } from langchain_community.vectorstores import FAISS from langchain_community.vectorstores import Chroma from langchain.text_splitter import CharacterTextSplitter from langchain_community.embeddings import OpenAIEmbeddings from langchain_community.embeddings import HuggingFaceBgeEmbeddings from langchain.chains import RetrievalQA import os import time from langchain.llms import Ollama from langchain.callbacks.manager import CallbackManagerForLLMRun from langchain.embeddings import HuggingFaceEmbeddings from typing import Any, List, Optional
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
Initialize Ollama LLM
llm = Ollama(model="llama3.1:latest")
Initialize Sentence Transformers embedding model
embedding_model = HuggingFaceEmbeddings( model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': True} # set True to compute cosine similarity )
from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper
llm = LangchainLLMWrapper(llm) embedding_model = LangchainEmbeddingsWrapper(embedding_model)
from datasets import Dataset from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall from ragas import evaluate
dataset = Dataset.from_dict(data_samples_2)
score = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=llm, embeddings=embedding_model) score.to_pandas() print(score)
/home/user/.pyenv/versions/3.11.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Evaluating: 100% 4/4 [01:29<00:00, 20.72s/it]
Failed to parse output. Returning None.
{'faithfulness': nan, 'answer_relevancy': 0.8362, 'context_precision': 1.0000, 'context_recall': 0.4000}
Expected behavior i want to give to faithfullness a value
Hey, @cpolcino! I'm here to help you out with any bugs or questions you have. I can also assist you in becoming a contributor!
The faithfulness metric returning NaN is likely due to the _compute_score method setting the score to NaN when no statements are generated from the answer. This can happen if the statements variable is None or empty after parsing the LLM's output. The relevant part of the code is:
if statements is None:
return np.nan
To ensure that the faithfulness metric does not return NaN, you need to verify that the LLM is correctly generating statements and that these statements are being parsed successfully [1].
Additionally, there could be issues with the embeddings provided by the LangchainEmbeddingsWrapper. If the embeddings do not accurately represent the semantic content of the text, it could lead to incorrect evaluations of faithfulness [2].
Here are a few steps you can take to troubleshoot and potentially resolve the issue:
-
Verify LLM Output: Ensure that the LLM is generating the expected output. You can print or log the output to check if it contains the necessary statements.
-
Check Embeddings: Ensure that the embeddings are accurate and correctly represent the semantic content of the text. You might want to test with different embedding models to see if the issue persists.
-
Increase Debugging Information: Add more logging or print statements in the
_compute_scoremethod to understand whystatementsmight beNone.
By following these steps, you should be able to identify and resolve the issue causing the faithfulness metric to return NaN.
To continue talking to Dosu, mention @dosu.
@dosu may you give the expected JSON file ?
HERE AN EXAMPLE OF WHAT I AM USING
data_samples_2 = { 'question': ["What are the dimensions of twist?"], 'answer': ["The limits for the mechanical characteristics of twist shall be ≤ 1.5% for board thickness < 1.6 mm and ≤ 1.1% for board thickness ≥ 1.6 mm."], 'contexts': [ ["Twist shall be measured in conformance with the test method 2.4.22c from IPC-TM-650.", "The PCB shall be placed on a horizontal surface so that it rests on three corners.", "The distance between the horizontal surface and the fourth corner of the PCB shall be measured as specified in Figure 9-3.", "The length of the diagonal of the PCB shall be measured.", "The twist shall be expressed in percentage terms.", "Twist shall be calculated as follows: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100", "The maximum twist shall be ≤ 1,5 %.", "The procurement authority may specify a more stringent requirement for twist in the PCB definition dossier."] ], 'ground_truth': ["""Based on the information provided in the document, the key dimensions related to twist in PCBs are:
The twist is measured as the distance between the horizontal surface and the fourth corner of the PCB when it rests on three corners.
The length of the diagonal of the PCB is measured.
The twist is expressed as a percentage and calculated using the following formula:
Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100
The maximum allowable twist is specified as ≤ 1.5%.
The document mentions that the procurement authority may specify a more stringent requirement for twist in the PCB definition dossier, with a typical stringent twist requirement being ≤ 0.75%.
So the key dimensions are:
The maximum distance between the fourth corner and the horizontal surface
The length of the PCB diagonal
The calculated percentage of twist
The maximum allowable percentage (1.5% or potentially lower)
The document also includes a diagram illustrating how twist is measured on a PCB."""]
}
i don't understand why it is s not able to parse the answer and contexts
The expected JSON file format for the data_samples_2 object to ensure the faithfulness metric can parse the answer and contexts correctly is as follows:
{
"response": "The answer to the question.",
"user_input": "The question to answer.",
"retrieved_contexts": [
"Context 1",
"Context 2",
"Context 3"
]
}
In your case, the data_samples_2 object should be structured like this:
{
"response": "The limits for the mechanical characteristics of twist shall be ≤ 1.5% for board thickness < 1.6 mm and ≤ 1.1% for board thickness ≥ 1.6 mm.",
"user_input": "What are the dimensions of twist?",
"retrieved_contexts": [
"Twist shall be measured in conformance with the test method 2.4.22c from IPC-TM-650.",
"The PCB shall be placed on a horizontal surface so that it rests on three corners.",
"The distance between the horizontal surface and the fourth corner of the PCB shall be measured as specified in Figure 9-3.",
"The length of the diagonal of the PCB shall be measured.",
"The twist shall be expressed in percentage terms.",
"Twist shall be calculated as follows: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100",
"The maximum twist shall be ≤ 1,5 %.",
"The procurement authority may specify a more stringent requirement for twist in the PCB definition dossier."
]
}
Ensure that your data_samples_2 object follows this format to allow the faithfulness metric to parse the answer and contexts correctly [1].
To continue talking to Dosu, mention @dosu.
import torch print(torch.cuda.is_available()) print(torch.cuda.device_count()) print(torch.cuda.get_device_name(0))
True 1 Quadro P4000
result = evaluate(amnesty_subset, metrics=[ context_precision, faithfulness, answer_relevancy, context_recall], llm=langchain_llm,embeddings=embeddings,raise_exceptions=False)
Error displaying widget: model not found
Exception raised in Job[4]: TimeoutError() Exception raised in Job[0]: TimeoutError() Exception raised in Job[5]: TimeoutError() Exception raised in Job[2]: TimeoutError() Exception raised in Job[3]: TimeoutError() Exception raised in Job[7]: TimeoutError() Exception raised in Job[1]: TimeoutError()
data_samples_2 = { "response": "The limits for the mechanical characteristics of twist shall be ≤ 1.5% for board thickness < 1.6 mm and ≤ 1.1% for board thickness ≥ 1.6 mm.", "user_input": "What are the dimensions of twist?", "retrieved_contexts": [ "Twist shall be measured in conformance with the test method 2.4.22c from IPC-TM-650.", "The PCB shall be placed on a horizontal surface so that it rests on three corners.", "The distance between the horizontal surface and the fourth corner of the PCB shall be measured as specified in Figure 9-3.", "The length of the diagonal of the PCB shall be measured.", "The twist shall be expressed in percentage terms.", "Twist shall be calculated as follows: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100", "The maximum twist shall be ≤ 1,5 %.", "The procurement authority may specify a more stringent requirement for twist in the PCB definition dossier." ] } from langchain_community.vectorstores import FAISS from langchain_community.vectorstores import Chroma from langchain.text_splitter import CharacterTextSplitter from langchain_community.embeddings import OpenAIEmbeddings from langchain_community.embeddings import HuggingFaceBgeEmbeddings from langchain.chains import RetrievalQA import os import time from langchain.llms import Ollama from langchain.callbacks.manager import CallbackManagerForLLMRun from langchain.embeddings import HuggingFaceEmbeddings from typing import Any, List, Optional
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
Initialize Ollama LLM
llm = Ollama(model="llama3.1:latest")
Initialize Sentence Transformers embedding model
embedding_model = HuggingFaceEmbeddings( model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': True} # set True to compute cosine similarity )
from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper
llm = LangchainLLMWrapper(llm) embedding_model = LangchainEmbeddingsWrapper(embedding_model)
from datasets import Dataset from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall from ragas import evaluate
dataset = Dataset.from_dict(data_samples_2)
score = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=llm, embeddings=embedding_model) score.to_pandas() print(score)
/home/user/.pyenv/versions/3.11.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
ArrowInvalid Traceback (most recent call last) /tmp/ipykernel_1468528/3475142547.py in <cell line: 0>() 36 37 ---> 38 dataset = Dataset.from_dict(data_samples_2) 39 40 score = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=llm, embeddings=embedding_model)
~/.pyenv/versions/3.11.9/lib/python3.11/site-packages/datasets/arrow_dataset.py in from_dict(cls, mapping, features, info, split) 966 arrow_typed_mapping[col] = data 967 mapping = arrow_typed_mapping --> 968 pa_table = InMemoryTable.from_pydict(mapping=mapping) 969 if info is None: 970 info = DatasetInfo()
~/.pyenv/versions/3.11.9/lib/python3.11/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs)
756 datasets.table.Table
757 """
--> 758 return cls(pa.Table.from_pydict(*args, **kwargs))
759
760 @classmethod
~/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyarrow/table.pxi in pyarrow.lib._Tabular.from_pydict()
~/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()
~/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_arrays()
~/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.validate()
~/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Column 1 named user_input expected length 140 but got length 33
with your correction i have this error
@dosu
The ArrowInvalid: Column 1 named user_input expected length 140 but got length 33 error occurs because the columns in your data_samples_2 dictionary have mismatched lengths. To resolve this, ensure that all columns have the same length.
Here's how you can adjust your data_samples_2 object to ensure all columns have the same length:
from datasets import Dataset
data_samples_2 = {
'question': ["What are the dimensions of twist?"],
'answer': ["The limits for the mechanical characteristics of twist shall be ≤ 1.5% for board thickness < 1.6 mm and ≤ 1.1% for board thickness ≥ 1.6 mm."],
'contexts': [
["Twist shall be measured in conformance with the test method 2.4.22c from IPC-TM-650.",
"The PCB shall be placed on a horizontal surface so that it rests on three corners.",
"The distance between the horizontal surface and the fourth corner of the PCB shall be measured as specified in Figure 9-3.",
"The length of the diagonal of the PCB shall be measured.",
"The twist shall be expressed in percentage terms.",
"Twist shall be calculated as follows: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100",
"The maximum twist shall be ≤ 1,5 %.","The procurement authority may specify a more stringent requirement for twist in the PCB definition dossier."]
],
'ground_truth': ["""Based on the information provided in the document, the key dimensions related to twist in PCBs are:
1. The twist is measured as the distance between the horizontal surface and the fourth corner of the PCB when it rests on three corners.
2. The length of the diagonal of the PCB is measured.
3. The twist is expressed as a percentage and calculated using the following formula:
Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100
4. The maximum allowable twist is specified as ≤ 1.5%.
5. The document mentions that the procurement authority may specify a more stringent requirement for twist in the PCB definition dossier, with a typical stringent twist requirement being ≤ 0.75%.
So the key dimensions are:
- The maximum distance between the fourth corner and the horizontal surface
- The length of the PCB diagonal
- The calculated percentage of twist
- The maximum allowable percentage (1.5% or potentially lower)
The document also includes a diagram illustrating how twist is measured on a PCB."""]
}
# Ensure all columns have the same length
min_length = min(len(data_samples_2[key]) for key in data_samples_2)
for key in data_samples_2:
data_samples_2[key] = data_samples_2[key][:min_length]
# Convert to HF dataset
dataset = Dataset.from_dict(data_samples_2)
This code snippet ensures that all columns in data_samples_2 have the same length by truncating them to the minimum length found among the columns. This should resolve the ArrowInvalid error [1].
To continue talking to Dosu, mention @dosu.
@dosu
/home/user/.pyenv/versions/3.11.9/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Evaluating: 100% 4/4 [01:54<00:00, 38.62s/it]
Failed to parse output. Returning None.
{'faithfulness': nan, 'answer_relevancy': 0.8362, 'context_precision': 1.0000, 'context_recall': 0.8333}
The faithfulness metric returning NaN indicates that no statements were generated from the answer. This can happen if the statements variable is None or empty after parsing the generated statements. The relevant part of the code is:
if num_statements:
score = faithful_statements / num_statements
else:
logger.warning("No statements were generated from the answer.")
score = np.nan
To ensure that the faithfulness metric does not return NaN, you need to verify that the LLM is correctly generating statements and that these statements are being parsed successfully. The method responsible for generating statements from the LLM's output is _create_statements_prompt:
def _create_statements_prompt(self, row: t.Dict) -> PromptValue:
assert self.sentence_segmenter is not None, "sentence_segmenter is not set"
text, question = row["response"], row["user_input"]
sentences = self.sentence_segmenter.segment(text)
sentences = [
sentence for sentence in sentences if sentence.strip().endswith(".")
]
sentences = "\n".join([f"{i}:{x}" for i, x in enumerate(sentences)])
prompt_value = self.statement_prompt.format(
question=question, answer=text, sentences=sentences
)
return prompt_value
Ensure that the LLM's output is in the expected format and that the sentence_segmenter is correctly segmenting the text into sentences. The output must be in JSON format and follow the structure defined by the StatementsAnswers and StatementFaithfulnessAnswers models.
Here is an example of the required format for the LONG_FORM_ANSWER_PROMPT:
{
"question": "Who was Albert Einstein and what is he best known for?",
"answer": "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.",
"sentences": [
"0:He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time.",
"1:He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics."
],
"analysis": [
{
"sentence_index": 0,
"simpler_statements": [
"Albert Einstein was a German-born theoretical physicist.",
"Albert Einstein is recognized as one of the greatest and most influential physicists of all time."
]
},
{
"sentence_index": 1,
"simpler_statements": [
"Albert Einstein was best known for developing the theory of relativity.",
"Albert Einstein also made important contributions to the development of the theory of quantum mechanics."
]
}
]
}
Ensure that your LLM's output follows this format to allow the faithfulness metric to parse the answer and contexts correctly [1].
To continue talking to Dosu, mention @dosu.
@dosu This can happen if the statements variable is None or empty after parsing the generated statements. Explain me better this part. i have created data_samples_1 by copyng for answer the answer of my personal llm.Is this correct?
The statements variable being None or empty after parsing the generated statements is likely due to the _statements_output_parser.aparse method returning None. This can happen if the parsing fails or the generated text does not contain the expected format.
Here is the relevant code snippet:
statements = await _statements_output_parser.aparse(
statements_gen.generations[0][0].text, p_value, self.llm, self.max_retries
)
if statements is None:
return np.nan
If _statements_output_parser.aparse returns None, the statements variable will be None, leading to the observed issue. Ensure that the generated text adheres to the expected JSON format as defined by the StatementsAnswers model. If the text does not match this format, the parser will fail, resulting in None [1][2].
Additionally, the aparse method handles errors by catching the OutputParserException. If an exception is caught and max_retries is not zero, it attempts to fix the output format using a predefined prompt (FIX_OUTPUT_FORMAT) and retries the parsing process. If max_retries is zero, it logs a warning and returns None [3].
Ensure that your LLM's output follows the expected format and that the max_retries parameter is set appropriately to handle any parsing failures.
To continue talking to Dosu, mention @dosu.
@dosu This can happen if the statements variable is None or empty after parsing the generated statements. Explain me better this part. i have created data_samples_1 by copyng for answer the answer of my personal llm.Is this correct?
The issue with the statements variable being None or empty after parsing the generated statements is likely due to the _statements_output_parser.aparse method failing to parse the LLM's output correctly. This can happen if the output does not match the expected JSON format.
Expected JSON Format for faithfulness Metric
The faithfulness metric expects the LLM's output to be a JSON object that matches the StatementFaithfulnessAnswers model. This model contains a list of StatementFaithfulnessAnswer objects, each with the following fields:
statement: The original statement, word-by-word.reason: The reason for the verdict.verdict: The verdict (0/1) of the faithfulness.
Here is an example of the expected structure:
[
{
"statement": "John is majoring in Biology.",
"reason": "John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology.",
"verdict": 0
},
{
"statement": "John is taking a course on Artificial Intelligence.",
"reason": "The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI.",
"verdict": 0
},
{
"statement": "John is a dedicated student.",
"reason": "The context states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication.",
"verdict": 1
},
{
"statement": "John has a part-time job.",
"reason": "There is no information given in the context about John having a part-time job.",
"verdict": 0
}
]
How to Ensure Correct Parsing
-
Verify LLM Output: Ensure that the LLM's output is in the expected JSON format. You can print or log the output to check if it contains the necessary fields (
statement,reason,verdict). -
Check Parsing Logic: The
_statements_output_parser.aparsemethod attempts to parse the result string. If parsing fails, it retries up to a specified number of times (max_retries). If all retries fail, it returnsNone.
Here is the relevant part of the aparse method:
class RagasoutputParser(PydanticOutputParser):
async def aparse(self, result: str, prompt: PromptValue, llm: BaseRagasLLM, max_retries: int = 1):
try:
output = super().parse(result)
except OutputParserException:
if max_retries != 0:
p_value = FIX_OUTPUT_FORMAT.format(prompt=prompt.to_string(), completion=result)
output = await llm.generate(p_value)
result = output.generations[0][0].text
return await self.aparse(result, prompt, llm, max_retries - 1)
else:
logger.warning("Failed to parse output. Returning None.")
return None
return output
Steps to Troubleshoot
- Print LLM Output: Before parsing, print the LLM's output to ensure it matches the expected JSON format.
- Adjust LLM Output: If the output does not match, adjust the LLM's prompt or post-process the output to ensure it conforms to the expected structure.
- Set
max_retries: Ensure thatmax_retriesis set to a reasonable value to allow for retries in case of initial parsing failures.
By following these steps, you can ensure that the statements variable is correctly populated, allowing the faithfulness metric to compute the score without returning NaN [1][2].
To continue talking to Dosu, mention @dosu.
@dosu give me the code for the troubleshooting
To troubleshoot the issue with the faithfulness metric returning NaN, you need to ensure that the statements variable is correctly populated and parsed. Here is the relevant code snippet and steps to help you debug the issue:
Code Snippet for Troubleshooting
-
Ensure
statementsis notNone:statements = await _statements_output_parser.aparse( statements_gen.generations[0][0].text, p_value, self.llm, self.max_retries ) if statements is None: return np.nan -
Check the
faithfulness_list:faithfulness_list = [ await _faithfulness_output_parser.aparse( text, p_value, self.llm, self.max_retries ) for text in nli_result_text ] faithfulness_list = [ faith.dicts() for faith in faithfulness_list if faith is not None ] if faithfulness_list: faithfulness_list = ensembler.from_discrete( faithfulness_list, "verdict", ) faithfulness_list = StatementFaithfulnessAnswers.parse_obj( faithfulness_list ) else: return np.nan return self._compute_score(faithfulness_list)
Steps to Troubleshoot
-
Print LLM Output: Before parsing, print the LLM's output to ensure it matches the expected JSON format.
print(statements_gen.generations[0][0].text) -
Verify JSON Format: Ensure that the LLM's output is in the expected JSON format. The output should look like this:
[ { "statement": "John is majoring in Biology.", "reason": "John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology.", "verdict": 0 }, { "statement": "John is taking a course on Artificial Intelligence.", "reason": "The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI.", "verdict": 0 }, { "statement": "John is a dedicated student.", "reason": "The context states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication.", "verdict": 1 }, { "statement": "John has a part-time job.", "reason": "There is no information given in the context about John having a part-time job.", "verdict": 0 } ] -
Check Parsing Logic: Ensure that the
_statements_output_parser.aparsemethod is correctly parsing the output. If parsing fails, it retries up to a specified number of times (max_retries). If all retries fail, it returnsNone.class RagasoutputParser(PydanticOutputParser): async def aparse(self, result: str, prompt: PromptValue, llm: BaseRagasLLM, max_retries: int = 1): try: output = super().parse(result) except OutputParserException: if max_retries != 0: p_value = FIX_OUTPUT_FORMAT.format(prompt=prompt.to_string(), completion=result) output = await llm.generate(p_value) result = output.generations[0][0].text return await self.aparse(result, prompt, llm, max_retries - 1) else: logger.warning("Failed to parse output. Returning None.") return None return output -
Set
max_retries: Ensure thatmax_retriesis set to a reasonable value to allow for retries in case of initial parsing failures.
By following these steps, you can ensure that the statements variable is correctly populated, allowing the faithfulness metric to compute the score without returning NaN [1][2].
To continue talking to Dosu, mention @dosu.
@dosu
from langchain_community.vectorstores import FAISS from langchain_community.vectorstores import Chroma from langchain.text_splitter import CharacterTextSplitter from langchain_community.embeddings import OpenAIEmbeddings from langchain_community.embeddings import HuggingFaceBgeEmbeddings from langchain.chains import RetrievalQA import os import time from langchain.llms import Ollama from langchain.callbacks.manager import CallbackManagerForLLMRun from langchain.embeddings import HuggingFaceEmbeddings from typing import Any, List, Optional
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
Initialize Ollama LLM
llm = Ollama(model="llama3.1:latest")
Initialize Sentence Transformers embedding model
embedding_model = HuggingFaceEmbeddings( model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': True} # set True to compute cosine similarity )
from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper
llm = LangchainLLMWrapper(llm) embedding_model = LangchainEmbeddingsWrapper(embedding_model)
from datasets import Dataset from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall from ragas import evaluate
dataset = Dataset.from_dict(data_samples_2)
score = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=llm, embeddings=embedding_model) score.to_pandas() print(score)
data_samples_2 = { 'question': ["What are the dimensions of twist?"], 'answer': ["The limits for the mechanical characteristics of twist shall be ≤ 1.5% for board thickness < 1.6 mm and ≤ 1.1% for board thickness ≥ 1.6 mm."], 'contexts': [ ["Twist shall be measured in conformance with the test method 2.4.22c from IPC-TM-650.", "The PCB shall be placed on a horizontal surface so that it rests on three corners.", "The distance between the horizontal surface and the fourth corner of the PCB shall be measured as specified in Figure 9-3.", "The length of the diagonal of the PCB shall be measured.", "The twist shall be expressed in percentage terms.", "Twist shall be calculated as follows: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100", "The maximum twist shall be ≤ 1,5 %.","The procurement authority may specify a more stringent requirement for twist in the PCB definition dossier."] ], 'ground_truth': ["""Based on the information provided in the document, the key dimensions related to twist in PCBs are:
- The twist is measured as the distance between the horizontal surface and the fourth corner of the PCB when it rests on three corners.
- The length of the diagonal of the PCB is measured.
- The twist is expressed as a percentage and calculated using the following formula: Twist [%] = max distance [mm] / (2 x length of PCB diagonal [mm]) x 100
- The maximum allowable twist is specified as ≤ 1.5%.
- The document mentions that the procurement authority may specify a more stringent requirement for twist in the PCB definition dossier, with a typical stringent twist requirement being ≤ 0.75%. So the key dimensions are:
- The maximum distance between the fourth corner and the horizontal surface
- The length of the PCB diagonal
- The calculated percentage of twist
- The maximum allowable percentage (1.5% or potentially lower) The document also includes a diagram illustrating how twist is measured on a PCB."""] }
i can only modify these two, what can i do to solve the problem with faithfullness?
To solve the problem with the faithfulness metric returning NaN, you need to ensure that the statements variable is correctly populated and parsed. Here are the steps to troubleshoot and resolve the issue:
Steps to Troubleshoot
-
Print LLM Output: Before parsing, print the LLM's output to ensure it matches the expected JSON format.
print(statements_gen.generations[0][0].text) -
Verify JSON Format: Ensure that the LLM's output is in the expected JSON format. The output should look like this:
[ { "statement": "John is majoring in Biology.", "reason": "John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology.", "verdict": 0 }, { "statement": "John is taking a course on Artificial Intelligence.", "reason": "The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI.", "verdict": 0 }, { "statement": "John is a dedicated student.", "reason": "The context states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication.", "verdict": 1 }, { "statement": "John has a part-time job.", "reason": "There is no information given in the context about John having a part-time job.", "verdict": 0 } ] -
Check Parsing Logic: Ensure that the
_statements_output_parser.aparsemethod is correctly parsing the output. If parsing fails, it retries up to a specified number of times (max_retries). If all retries fail, it returnsNone.class RagasoutputParser(PydanticOutputParser): async def aparse(self, result: str, prompt: PromptValue, llm: BaseRagasLLM, max_retries: int = 1): try: output = super().parse(result) except OutputParserException: if max_retries != 0: p_value = FIX_OUTPUT_FORMAT.format(prompt=prompt.to_string(), completion=result) output = await llm.generate(p_value) result = output.generations[0][0].text return await self.aparse(result, prompt, llm, max_retries - 1) else: logger.warning("Failed to parse output. Returning None.") return None return output -
Set
max_retries: Ensure thatmax_retriesis set to a reasonable value to allow for retries in case of initial parsing failures.
Example Code for Troubleshooting
Here is an example code snippet to help you troubleshoot the issue:
import json
import numpy as np
import logging
# Ensure logging is set up
logging.basicConfig(level=logging.DEBUG)
# Print the LLM output
print("LLM Output:")
print(statements_gen.generations[0][0].text)
# Parse the LLM output
statements = await _statements_output_parser.aparse(
statements_gen.generations[0][0].text, p_value, self.llm, self.max_retries
)
# Check if statements is None
if statements is None:
logging.warning("Failed to parse statements. Returning NaN.")
return np.nan
# Ensure statements is not a coroutine before calling dicts()
if inspect.iscoroutine(statements):
statements = await statements
# Add error handling and logging
if not hasattr(statements, "dicts"):
logging.error(f"Unexpected type for statements: {type(statements)}")
logging.error(f"Statements content: {statements}")
raise AttributeError(
f"'statements' object of type {type(statements)} has no attribute 'dicts'"
)
# Process the statements
statements = [item["simpler_statements"] for item in statements.dicts()]
statements = [item for sublist in statements for item in sublist]
# Continue with the rest of the faithfulness metric calculation
faithfulness_list = [
await _faithfulness_output_parser.aparse(
text, p_value, self.llm, self.max_retries
)
for text in nli_result_text
]
faithfulness_list = [
faith.dicts() for faith in faithfulness_list if faith is not None
]
if faithfulness_list:
faithfulness_list = ensembler.from_discrete(
faithfulness_list,
"verdict",
)
faithfulness_list = StatementFaithfulnessAnswers.parse_obj(
faithfulness_list
)
else:
return np.nan
return self._compute_score(faithfulness_list)
By following these steps and using the provided code snippet, you can ensure that the statements variable is correctly populated, allowing the faithfulness metric to compute the score without returning NaN [1][2].
To continue talking to Dosu, mention @dosu.
@dosu
how do i solve this error?
Exception raised in Job[7]: TimeoutError()
Exception raised in Job[4]: TimeoutError()
Exception raised in Job[3]: TimeoutError()
Exception raised in Job[5]: TimeoutError()
Exception raised in Job[1]: TimeoutError()
Exception raised in Job[0]: TimeoutError()
To solve the TimeoutError exceptions raised in multiple jobs, you can modify the wrap_callable_with_index method in the Executor class to handle TimeoutError specifically. Here is an example of how you can do this:
@dataclass
class Executor:
# ... other fields and methods ...
def wrap_callable_with_index(self, callable: t.Callable, counter):
async def wrapped_callable_async(*args, **kwargs):
result = np.nan
try:
result = await callable(*args, **kwargs)
except MaxRetriesExceeded as e:
# this only for testset generation v2
logger.warning(f"max retries exceeded for {e.evolution}")
except TimeoutError as e:
logger.error(f"TimeoutError in Job[{counter}]: {str(e)}")
if self.raise_exceptions:
raise e
except Exception as e:
if self.raise_exceptions:
raise e
else:
exec_name = type(e).__name__
exec_message = str(e)
logger.error(
"Exception raised in Job[%s]: %s(%s)",
counter,
exec_name,
exec_message,
exc_info=False,
)
return counter, result
return wrapped_callable_async
This modification will log the TimeoutError specifically and re-raise it if self.raise_exceptions is set to True. This way, you can handle TimeoutError exceptions separately from other exceptions [1].
To continue talking to Dosu, mention @dosu.
I'm closing this since this is a duplicate of #1170. We will get this sorted as quickly as possible