Try Increasing max_tokens and try again
[ ] I have checked the documentation and related resources and couldn't resolve my bug.
I want to know why the output tokens are getting exceeded while evaluation is in progress. What might be this issue and what will be the possible fix for this.
Ragas version: 0.3.9 Python version: 3.11.0
Code to Reproduce from libraries import *
evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["model_deployment"], model = azure_config["model_name"], openai_api_version = azure_config["api_version"], validate_base_url = False, n = 1 ))
evaluator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["embedding_deployment"], model = azure_config["embedding_name"], openai_api_version = azure_config["api_version"], ))
def Ragas_Evaluation(dataset):
column_map = {
"question": "Test Cases",
"ground_truth": "Ground Truth",
"contexts": "Context",
"answer": "AI_Response",
"reference_contexts": "Reference_Contexts",
}
evaluator_metric = ResponseRelevancy(
llm=evaluator_llm,
embeddings=evaluator_embeddings,
strictness=1 )
metrics = [
Faithfulness(llm=evaluator_llm),
ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=1),
ContextRecall(llm=evaluator_llm),
ContextPrecision(llm=evaluator_llm),
ContextUtilization(llm=evaluator_llm),
ContextEntityRecall(llm=evaluator_llm),
SummarizationScore(llm=evaluator_llm),
]
result = evaluate(
dataset,
column_map=column_map,
show_progress=True,
metrics=metrics,
embeddings=evaluator_embeddings,
raise_exceptions=True,
)
return result
Error trace Try Increasing max_tokens and try again.
Expected behavior Need to know what is the reason for max_tokens getting exceeded.
Can you try using the new metrics collections.
from ragas.metrics.collections import ...
There's a max tokens config you can use. Checkout llm_factory.
I will try this. But, I need to know why it is throwing that error even though I am not expecting any huge output.
I will try this. But, I need to know why it is throwing that error even though I am not expecting any huge output.
Default 1024 is too small for any metric that generates reasoning/explanation, uses structured output (JSON) or requires few-shot examples in prompt. Multi-step metrics multiply the problem:
- Faithfulness needs 1024 for statements + 1024 for NLI = 2048+ total
- ResponseRelevancy involves embedding + LLM evaluation
The newer metrics are little more optimised. Start with the metrics collections and at least 2048 for starters.
I am giving a dataset where it contains columns named question, ground truth, contexts, answer and reference contexts
Each column description is mentioned below
question : user prompt ground truth : Actual answer from the document context : the document full content. answer : summary from end point reference context : the list of chunks and document links from the end point.
My Question is, while calculating the metrics is it passing one row after another row or is it passing complete dataset at a time.
For example if I have 12 rows in my dataset and each row will have values in above mentioned columns, is that complete row will be sent once for evaluation or all 12 rows will be going at once for evaluation?
while calculating the metrics is it passing one row after another row or is it passing complete dataset at a time.
row-by-row
So, is it like row-by-row the data will be sent to the each metric and the complete reasoning and explanation will be generated by the llm. So, it is taking more tokens to finish the evaluation process and give the scores? Is this correct?
Also, if it is row-by-row, each row will have max max limit of 16,383 tokens if we use gpt-4o right?
Yes, you're right. However, if it's not done async per row based, it'll not give correct evals results.
SO, I want to know what will be the reason that the max_tokens limit of 16,384 is getting exceeded when it is taking row-by-row?
It shouldn't be 16,384. I believe it should be between 2048 to 4096 🤔
Can you show me your code where it's taking more than that?
from libraries import *
evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["model_deployment"], model = azure_config["model_name"], openai_api_version = azure_config["api_version"], validate_base_url = False, n = 1, max_tokens=16383 ))
evaluator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["embedding_deployment"], model = azure_config["embedding_name"], openai_api_version = azure_config["api_version"], ))
def Ragas_Evaluation(dataset):
column_map = {
"question": "Test Cases",
"ground_truth": "Ground Truth",
"contexts": "Context",
"answer": "AI_Response",
"reference_contexts": "Reference_Contexts",
}
evaluator_metric = ResponseRelevancy(
llm=evaluator_llm,
embeddings=evaluator_embeddings,
strictness=1 )
metrics = [
Faithfulness(llm=evaluator_llm),
ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=1),
ContextRecall(llm=evaluator_llm),
ContextPrecision(llm=evaluator_llm),
ContextUtilization(llm=evaluator_llm),
ContextEntityRecall(llm=evaluator_llm),
SummarizationScore(llm=evaluator_llm),
]
result = evaluate(
dataset,
column_map=column_map,
show_progress=True,
metrics=metrics,
embeddings=evaluator_embeddings,
raise_exceptions=True,
)
return result
This is the code
I am sending whole document content in the context field. Is that the issue why it is throwing the error? If that is the issue, that error should be with input tokens, why the issue is with output tokens?
@AkashKandra we're not supporting evaluate anymore. Please use the latest metrics.collections with llm_factory approach as mentioned before.
It's not just structural change. It's an architecture shift.
metrics.collections is not supporting ResponseRelevancy, SummarizationScore, answer_relevancy, faithfulness, context_recall, context_precision. Is there any other package for these libraries?
metrics.collections is not supporting ResponseRelevancy, SummarizationScore, answer_relevancy, faithfulness, context_recall, context_precision. Is there any other package for these libraries?
ResponseRelevancy is merged with AnswerRelevancy since they're similar.
SummarizationScore is SummaryScore.
Faithfulness, ContextRecall and ContextPrecision are still there.
TypeError: All metrics must be initialised metric objects, e.g: metrics=[BleuScore(), AspectCritic()]
why I am getting this error when I initialized all metrics as objects.
TypeError: All metrics must be initialised metric objects, e.g: metrics=[BleuScore(), AspectCritic()]
why I am getting this error when I initialized all metrics as objects.
AspectCritic is removed in collections. Can you show me your code? Perhaps it's still on older version.
It is showing BlueScore, AspectCritic as example. I have never used it.
This is my updated code
from libraries import *
openai_client = AzureChatOpenAI( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["model_deployment"], model = azure_config["model_name"], openai_api_version = azure_config["api_version"], validate_base_url = False, n = 1, max_tokens = 16383 )
embedding_client = AzureOpenAIEmbeddings( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["embedding_deployment"], model = azure_config["embedding_name"], openai_api_version = azure_config["api_version"], )
evaluator_llm = llm_factory("gpt-4o", client=openai_client)
evaluator_embeddings = embedding_factory("openai", "text-embedding-3-small", client=embedding_client)
def Ragas_Evaluation(dataset):
column_map = {
"question": "Question",
"ground_truth": "Ground Truth",
"contexts": "Context",
"answer": "AI_Response",
"reference_contexts": "Reference_Contexts",
}
evaluator_metric = AnswerRelevancy(
llm=evaluator_llm,
embeddings=evaluator_embeddings,
strictness=1 )
# metrics = [
# Faithfulness(llm=evaluator_llm),
# ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=1),
# ContextRecall(llm=evaluator_llm),
# ContextPrecision(llm=evaluator_llm),
# ContextUtilization(llm=evaluator_llm),
# ContextEntityRecall(llm=evaluator_llm),
# NoiseSensitivity(llm=evaluator_llm),
# SummarizationScore(llm=evaluator_llm),
# ]
metrics = [
Faithfulness(llm=evaluator_llm),
AnswerRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=1),
ContextRecall(llm=evaluator_llm),
ContextPrecision(llm=evaluator_llm),
ContextUtilization(llm=evaluator_llm),
ContextEntityRecall(llm=evaluator_llm),
SummaryScore(llm=evaluator_llm)
]
# metrics = [
# Faithfulness(llm=evaluator_llm),
# ]
result = evaluate(
dataset,
column_map=column_map,
show_progress=True,
metrics=metrics,
embeddings=evaluator_embeddings,
raise_exceptions=True,
)
#result={'faithfulness': 0.7917, 'answer_relevancy': 0.7515, 'context_recall': 0.6500, 'context_precision': 0.7000, 'context_utilization': 0.8000, 'context_entity_recall': 0.2964, 'noise_sensitivity(mode=relevant)': 0.3502, 'summary_score': 0.5697}
return result
evaluate is deprecated in collections. Can you share imports part?
from ragas import ( SingleTurnSample, EvaluationDataset, evaluate ) from ragas.embeddings import LangchainEmbeddingsWrapper, embedding_factory from ragas.llms import LangchainLLMWrapper, llm_factory, InstructorLLM from ragas.metrics.collections import ( Faithfulness, AnswerRelevancy, ContextRecall, ContextPrecision, ContextUtilization, ContextEntityRecall, NoiseSensitivity, SummaryScore ) from ragas.testset import TestsetGenerator
if evaluate is deprecated, then which evaluation function to use to evaluate the metics for my dataset?
if evaluate is deprecated, then which evaluation function to use to evaluate the metics for my dataset?
The structure has changed in recent versions. You won't need evaluate anymore. It'll be removed soon.
For starters, would recommend to start from quickstart. The check out all available metrics as per your need. Follow the docs. You can also checkout examples and turorials.
We're in middle of updating rest of docs in the meantime. Soon, it'll be all cleaned up.
Okay Thankyou!
It seems the issue was answered, closing this now.