faithfulness_score: nan
[ ] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question faithfulness_score: always be nan
Code Examples from agent import * from datasets import Dataset import json import random
文本加载
from langchain_community.document_loaders import WebBaseLoader loader = WebBaseLoader("https://baike.baidu.com/item/%E7%BA%BD%E7%BA%A6/6230") loader.requests_kwargs = {'verify':False} data = loader.load()
print(data)
创建向量索引
from langchain.indexes import VectorstoreIndexCreator index = VectorstoreIndexCreator().from_loaders([loader])
创建qa链
from langchain.chains import RetrievalQA from langchain_community.chat_models import ChatOpenAI llm = ChatOpenAI() qa_chain = RetrievalQA.from_chain_type( llm, retriever=index.vectorstore.as_retriever(), return_source_documents=True,verbose = VERBOSE )
question = "纽约市的名字是怎么得来的?" result = qa_chain({"query": question})
result["result"]
print("========= chain result ==========")
print(result)
result['ground_truths'] = "纽约市的名字“纽约”来源于荷兰战败后将新阿姆斯特丹割让给英国的事件。" from ragas.metrics import faithfulness, answer_relevancy, context_relevancy, context_recall from ragas.langchain.evalchain import RagasEvaluatorChain
make eval chains
eval_chains = { m.name: RagasEvaluatorChain(metric=m) for m in [faithfulness, answer_relevancy, context_relevancy, context_recall] }
evaluate
for name, eval_chain in eval_chains.items(): score_name = f"{name}_score" print(f"{score_name}: {eval_chain(result)[score_name]}")
Additional context
Entering new RetrievalQA chain...
Finished chain. faithfulness_score: nan answer_relevancy_score: 0.8500799716619457 context_relevancy_score: 0.016666666666666666 context_recall_score: 1.0
The faithfulness_score is always nan in your evaluation because metrics like faithfulness require the source_documents to be present in the result. Ensure that the source_documents are included in the result when running the evaluation.
Here are some key points to check:
-
Ensure
source_documentsare included in the result:result = qa_chain({"query": question}) if not result.get("source_documents"): raise ValueError("Source documents are missing in the result.") -
Check the segmentation and statement generation: The
faithfulnessmetric uses a sentence segmenter to break down the response into simpler statements. Ensure that this segmentation is working correctly.from some_sentence_segmenter import segmenter # Replace with actual import sentences = segmenter.segment(result["result"]) sentences = [sentence for sentence in sentences if sentence.strip().endswith(".")] -
Verify the NLI (Natural Language Inference) step: The metric uses NLI to judge the faithfulness of the statements based on the context. Ensure that the NLI model is correctly set up and generating valid results.
from some_nli_model import nli_model # Replace with actual import contexts_str = " ".join([doc.page_content for doc in result["source_documents"]]) statements_str = " ".join(sentences) p_value = f"Context: {contexts_str}\nStatements: {statements_str}" nli_result = nli_model.generate(p_value) -
Evaluate faithfulness:
eval_result = eval_chains["faithfulness"](result) print(f"faithfulness_score: {eval_result['faithfulness_score']}")
By ensuring these steps are correctly implemented and the necessary data is present, you should be able to avoid nan values in your faithfulness_score [1][2].
To continue talking to Dosu, mention @dosu.
This should be the problem of chinese
data_samples = {
'question': ['When was the first super bowl?'],
'answer': ['The first superbowl was held on Jan 15, 1967'],
'retrieved_contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,']],
}
dataset = Dataset.from_dict(data_samples)
metrics = [
faithfulness
]
azure_model = AzureChatOpenAI(
openai_api_version="2023-05-15",
azure_endpoint=azure_configs["base_url"],
azure_deployment=azure_configs["model_deployment"],
model=azure_configs["model_name"],
validate_base_url=False,
)
# init the embeddings for answer_relevancy, answer_correctness and answer_similarity
azure_embeddings = AzureOpenAIEmbeddings(
openai_api_version="2023-05-15",
azure_endpoint=azure_configs["base_url"],
azure_deployment=azure_configs["embedding_deployment"],
model=azure_configs["embedding_name"],
)
result = evaluate(
dataset, metrics=metrics, llm=azure_model, embeddings=azure_embeddings
)
print(result.to_pandas())
Even for the given example getting the faithfulness as NaN
Evaluating: 0%| | 0/1 [00:00<?, ?it/s]No statements were generated from the answer.
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.67s/it]
question ... faithfulness
0 When was the first super bowl? ... NaN
data_samples = { 'question': ['When was the first super bowl?'], 'answer': ['The first superbowl was held on Jan 15, 1967'], 'retrieved_contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,']], } dataset = Dataset.from_dict(data_samples) metrics = [ faithfulness ] azure_model = AzureChatOpenAI( openai_api_version="2023-05-15", azure_endpoint=azure_configs["base_url"], azure_deployment=azure_configs["model_deployment"], model=azure_configs["model_name"], validate_base_url=False, ) # init the embeddings for answer_relevancy, answer_correctness and answer_similarity azure_embeddings = AzureOpenAIEmbeddings( openai_api_version="2023-05-15", azure_endpoint=azure_configs["base_url"], azure_deployment=azure_configs["embedding_deployment"], model=azure_configs["embedding_name"], ) result = evaluate( dataset, metrics=metrics, llm=azure_model, embeddings=azure_embeddings ) print(result.to_pandas())Even for the given example getting the faithfulness as NaN
Evaluating: 0%| | 0/1 [00:00<?, ?it/s]No statements were generated from the answer. Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.67s/it] question ... faithfulness 0 When was the first super bowl? ... NaN
I'm sorry to see this result. I also have the same problem here. Who has a solution?
@shahules786 @jjmachan Can Someone help here please?
You can try to change LLMs and embeddings, I used ZhipuAI and the result is much better. My Chinese dataset has 147 rows, and there are 12 nulls when I use Zhipiai, and only 16 non-nulls when I use openai. I'm not sure if this has anything to do with the Chinese language, but it is not like openai can't compute this metric at all, which really confuses me. I think a certain amount of nulls is acceptable, for example when using ZHipuAi, but less than 10% of non-nulls with openai basically means that this metric is not meaningful for my dataset.
I solved my problem just now, I saw the output in the terminal like this:
No statements were generated from the answer
and checked in issues then found this issue:https://github.com/explodinggradients/ragas/issues/1383
It is the the problem of chinese and that's why when I used Zhipuai there were few "nan" values The problem was solved after adding the following lines before evaluate :
from ragas.experimental.metrics._faithfulness import FaithfulnessExperimental
from ragas.metrics.base import get_segmenter
faithfulness = FaithfulnessExperimental()
faithfulness.sentence_segmenter = get_segmenter(language="chinese", clean=False)
maybe you can have a try