why is answer_relevancy always 0?
environment: python 3.9.20 datasets 3.0.1 langchain 0.3.3 langchain-community 0.3.2 langchain-core 0.3.10 langchain-openai 0.2.2 langchain-text-splitters 0.3.0 ragas 0.0.22
code:
from langchain_community.chat_models import ChatOpenAI
from langchain.prompts.chat import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate,
MessagesPlaceholder
from operator import itemgetter
import warnings
import re
import os
import pandas as pd
from pandas import array
from langchain_community.vectorstores import FAISS, DistanceStrategy
from langchain.embeddings import HuggingFaceEmbeddings
from pydantic import BaseModel
from langchain_core.output_parsers import JsonOutputParser
import json
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain_text_splitters import RecursiveCharacterTextSplitter
from datasets import Dataset
import openai
from dotenv import load_dotenv
load_dotenv('.env')
warnings.filterwarnings('ignore') content_path = r"data/test.md" with open(content_path, "r", encoding='utf-8') as f: page_content = f.read()
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown_document = page_content
headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ]
markdown_splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, strip_headers=False ) md_header_splits = markdown_splitter.split_text(markdown_document)
chat = ChatOpenAI( model="Qwen2", temperature=0.3, openai_api_key="xxxx", openai_api_base='xxxx', stop=['<|im_end|>'] )
system_prompt = SystemMessagePromptTemplate.from_template('你是一个对话助手,基于文档内容回答用户问题') user_prompt = HumanMessagePromptTemplate.from_template('''
基于下面的文档内容回答问题:
{context}
问题: {query}
''')
full_chat_prompt = ChatPromptTemplate.from_messages( [system_prompt, MessagesPlaceholder(variable_name="chat_history"), user_prompt])
''' <|im_start|>system 你是一个对话助手. <|im_end|> ... <|im_start|>user 仅基于下面的文本回答问题:
{context}
Question: {query} <|im_end|> <|im_start|>assitant ...... <|im_end|> '''
init embedding model
embedding_model_name = "embedding\bge-large-zh-v1.5" embedding_model_kwargs = {'device': 'cpu'} embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}
embed_model = HuggingFaceEmbeddings( model_name=embedding_model_name, model_kwargs=embedding_model_kwargs, encode_kwargs=embedding_encode_kwargs )
vector_load = FAISS.load_local('test.faiss', embed_model, allow_dangerous_deserialization=True) faiss_retriever = vector_load.as_retriever(search_type="similarity", search_kwargs={"k": 20}) chat_chain = { "context": itemgetter("query") | faiss_retriever, "query": itemgetter("query"), "chat_history": itemgetter("chat_history"), } | full_chat_prompt | chat
chat_history = []
questions = [ "算法编排是什么?" ] ground_truths = [ ["算法编排是可通过拖拉拽的方式在算法画布上定义组件,实现不同的业务场景下的数据流转和计算的过程。"] ] answers = [] contexts = [] for question in questions: answers.append(chat_chain.invoke({'query': question, 'chat_history': chat_history}).content) contexts.append([splits.page_content for splits in faiss_retriever.get_relevant_documents(question)])
data = { "question": questions, "answer": answers, "contexts": contexts, "ground_truths": ground_truths } print(data) dataset = Dataset.from_dict(data) print(dataset)
from langchain.chat_models.openai import ChatOpenAI from ragas.llms.langchain import LangchainLLM from ragas.metrics import context_recall, context_precision, faithfulness, answer_relevancy
rag_llm = LangchainLLM(llm=chat)
faithfulness.llm = rag_llm faithfulness.embeddings = embed_model context_recall.llm = rag_llm context_recall.embeddings = embed_model context_precision.llm = rag_llm context_precision.embeddings = embed_model answer_relevancy.llm = rag_llm answer_relevancy.embeddings = embed_model
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_relevancy, context_recall, context_precision, )
result = evaluate( dataset=dataset, metrics=[ # faithfulness, answer_relevancy, # context_relevancy, # context_recall, # context_precision, ], )
print(result)
pd.set_option("display.max_colwidth", None)
df = result.to_pandas() print(df) df.to_csv("data/test.csv", index=False, encoding='utf_8_sig')
outcome: evaluating with [answer_relevancy] 100%|██████████| 1/1 [00:03<00:00, 3.20s/it] {'answer_relevancy': 0.0000} question ... answer_relevancy 0 算法编排是什么? ... 0.0
[1 rows x 5 columns]
Process finished with exit code 0
why is answer_relevancy always 0?What shoud I do ???please help me!!! thank you !!!
The answer_relevancy is always 0 because the score calculation includes a multiplication by int(not committal). If committal is True (meaning at least one answer is noncommittal), int(not committal) becomes 0, resulting in a score of 0.
Here's the relevant part of the code:
committal = np.any([answer.noncommittal for answer in answers])
score = cosine_sim.mean() * int(not committal)
If committal is True, the score will always be 0 due to the multiplication [1].
To continue talking to Dosu, mention @dosu.
So what should I do? @Dosu
The
answer_relevancyis always 0 because the score calculation includes a multiplication byint(not committal). IfcommittalisTrue(meaning at least one answer is noncommittal),int(not committal)becomes 0, resulting in a score of 0.Here's the relevant part of the code:
committal = np.any([answer.noncommittal for answer in answers]) score = cosine_sim.mean() * int(not committal)If
committalisTrue, the score will always be 0 due to the multiplication [1].To continue talking to Dosu, mention @dosu.
so what shoud I do?
To resolve the issue of answer_relevancy always being 0, follow these steps:
-
Check Noncommittal Flag: Ensure that the
noncommittalflag is set correctly. If all answers are marked as noncommittal, the score will be zero due to the multiplication byint(not committal). -
Verify Embeddings: Make sure the embeddings are correctly initialized and used. The
calculate_similarityfunction relies on embeddings to compute cosine similarity. Incorrect embeddings will lead to a failed similarity calculation. -
Inspect Generated Questions: Ensure that the generated questions are not empty. If they are, the score will be set to
np.nan. -
Review LLM Configuration: Confirm that the language model (
llm) is properly configured and capable of generating relevant questions. -
Adjust Strictness: Consider adjusting the
strictnessparameter, which affects the number of questions generated per answer. -
Understand Relevancy Calculation: The
answer_relevancyis calculated as the mean cosine similarity between the originaluser_inputand generated questions based on theresponse. Ensure that the embeddings for both the generated questions and the original question are accurate [1][2].
To continue talking to Dosu, mention @dosu.
- Check Noncommittal Flag
how to Check Noncommittal Flag?