ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Try Increasing max_tokens and try again

Open AkashKandra opened this issue 1 month ago • 23 comments

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

I want to know why the output tokens are getting exceeded while evaluation is in progress. What might be this issue and what will be the possible fix for this.

Ragas version: 0.3.9 Python version: 3.11.0

Code to Reproduce from libraries import *

evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["model_deployment"], model = azure_config["model_name"], openai_api_version = azure_config["api_version"], validate_base_url = False, n = 1 ))

evaluator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["embedding_deployment"], model = azure_config["embedding_name"], openai_api_version = azure_config["api_version"], ))

def Ragas_Evaluation(dataset):

column_map = {
    "question": "Test Cases",
    "ground_truth": "Ground Truth",
    "contexts": "Context", 
    "answer": "AI_Response",
    "reference_contexts": "Reference_Contexts", 
}


evaluator_metric = ResponseRelevancy(
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    strictness=1  )




metrics = [
    Faithfulness(llm=evaluator_llm),
    ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=1),
    ContextRecall(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextUtilization(llm=evaluator_llm),
    ContextEntityRecall(llm=evaluator_llm),
    SummarizationScore(llm=evaluator_llm),
]



result = evaluate(
    dataset,
    column_map=column_map,
    show_progress=True,
    metrics=metrics,
    embeddings=evaluator_embeddings,
    raise_exceptions=True,
)

return result

Error trace Try Increasing max_tokens and try again.

Expected behavior Need to know what is the reason for max_tokens getting exceeded.

AkashKandra avatar Nov 19 '25 12:11 AkashKandra

Can you try using the new metrics collections.

from ragas.metrics.collections import ...

There's a max tokens config you can use. Checkout llm_factory.

anistark avatar Nov 19 '25 12:11 anistark

I will try this. But, I need to know why it is throwing that error even though I am not expecting any huge output.

AkashKandra avatar Nov 19 '25 12:11 AkashKandra

I will try this. But, I need to know why it is throwing that error even though I am not expecting any huge output.

Default 1024 is too small for any metric that generates reasoning/explanation, uses structured output (JSON) or requires few-shot examples in prompt. Multi-step metrics multiply the problem:

  • Faithfulness needs 1024 for statements + 1024 for NLI = 2048+ total
  • ResponseRelevancy involves embedding + LLM evaluation

The newer metrics are little more optimised. Start with the metrics collections and at least 2048 for starters.

anistark avatar Nov 19 '25 12:11 anistark

I am giving a dataset where it contains columns named question, ground truth, contexts, answer and reference contexts

Each column description is mentioned below

question : user prompt ground truth : Actual answer from the document context : the document full content. answer : summary from end point reference context : the list of chunks and document links from the end point.

My Question is, while calculating the metrics is it passing one row after another row or is it passing complete dataset at a time.

For example if I have 12 rows in my dataset and each row will have values in above mentioned columns, is that complete row will be sent once for evaluation or all 12 rows will be going at once for evaluation?

AkashKandra avatar Nov 20 '25 07:11 AkashKandra

while calculating the metrics is it passing one row after another row or is it passing complete dataset at a time.

row-by-row

anistark avatar Nov 20 '25 09:11 anistark

So, is it like row-by-row the data will be sent to the each metric and the complete reasoning and explanation will be generated by the llm. So, it is taking more tokens to finish the evaluation process and give the scores? Is this correct?

AkashKandra avatar Nov 21 '25 05:11 AkashKandra

Also, if it is row-by-row, each row will have max max limit of 16,383 tokens if we use gpt-4o right?

AkashKandra avatar Nov 21 '25 05:11 AkashKandra

Yes, you're right. However, if it's not done async per row based, it'll not give correct evals results.

anistark avatar Nov 21 '25 05:11 anistark

SO, I want to know what will be the reason that the max_tokens limit of 16,384 is getting exceeded when it is taking row-by-row?

AkashKandra avatar Nov 21 '25 10:11 AkashKandra

It shouldn't be 16,384. I believe it should be between 2048 to 4096 🤔

Can you show me your code where it's taking more than that?

anistark avatar Nov 21 '25 11:11 anistark

from libraries import *

evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["model_deployment"], model = azure_config["model_name"], openai_api_version = azure_config["api_version"], validate_base_url = False, n = 1, max_tokens=16383 ))

evaluator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["embedding_deployment"], model = azure_config["embedding_name"], openai_api_version = azure_config["api_version"], ))

def Ragas_Evaluation(dataset):

column_map = {
    "question": "Test Cases",
    "ground_truth": "Ground Truth",
    "contexts": "Context", 
    "answer": "AI_Response",
    "reference_contexts": "Reference_Contexts", 
}


evaluator_metric = ResponseRelevancy(
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    strictness=1  )




metrics = [
    Faithfulness(llm=evaluator_llm),
    ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=1),
    ContextRecall(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextUtilization(llm=evaluator_llm),
    ContextEntityRecall(llm=evaluator_llm),
    SummarizationScore(llm=evaluator_llm),
]



result = evaluate(
    dataset,
    column_map=column_map,
    show_progress=True,
    metrics=metrics,
    embeddings=evaluator_embeddings,
    raise_exceptions=True,
)

return result

This is the code

AkashKandra avatar Nov 21 '25 11:11 AkashKandra

I am sending whole document content in the context field. Is that the issue why it is throwing the error? If that is the issue, that error should be with input tokens, why the issue is with output tokens?

AkashKandra avatar Nov 21 '25 11:11 AkashKandra

@AkashKandra we're not supporting evaluate anymore. Please use the latest metrics.collections with llm_factory approach as mentioned before.

It's not just structural change. It's an architecture shift.

anistark avatar Nov 21 '25 11:11 anistark

metrics.collections is not supporting ResponseRelevancy, SummarizationScore, answer_relevancy, faithfulness, context_recall, context_precision. Is there any other package for these libraries?

AkashKandra avatar Nov 25 '25 04:11 AkashKandra

metrics.collections is not supporting ResponseRelevancy, SummarizationScore, answer_relevancy, faithfulness, context_recall, context_precision. Is there any other package for these libraries?

ResponseRelevancy is merged with AnswerRelevancy since they're similar. SummarizationScore is SummaryScore. Faithfulness, ContextRecall and ContextPrecision are still there.

anistark avatar Nov 25 '25 05:11 anistark

TypeError: All metrics must be initialised metric objects, e.g: metrics=[BleuScore(), AspectCritic()]

why I am getting this error when I initialized all metrics as objects.

AkashKandra avatar Nov 26 '25 07:11 AkashKandra

TypeError: All metrics must be initialised metric objects, e.g: metrics=[BleuScore(), AspectCritic()]

why I am getting this error when I initialized all metrics as objects.

AspectCritic is removed in collections. Can you show me your code? Perhaps it's still on older version.

anistark avatar Nov 26 '25 07:11 anistark

It is showing BlueScore, AspectCritic as example. I have never used it.

This is my updated code

from libraries import *

openai_client = AzureChatOpenAI( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["model_deployment"], model = azure_config["model_name"], openai_api_version = azure_config["api_version"], validate_base_url = False, n = 1, max_tokens = 16383 )

embedding_client = AzureOpenAIEmbeddings( openai_api_key = api_key, azure_endpoint = azure_config["base_url"], azure_deployment = azure_config["embedding_deployment"], model = azure_config["embedding_name"], openai_api_version = azure_config["api_version"], )

evaluator_llm = llm_factory("gpt-4o", client=openai_client)

evaluator_embeddings = embedding_factory("openai", "text-embedding-3-small", client=embedding_client)

def Ragas_Evaluation(dataset):

column_map = {
    "question": "Question",
    "ground_truth": "Ground Truth",
    "contexts": "Context", 
    "answer": "AI_Response",
    "reference_contexts": "Reference_Contexts", 
}


evaluator_metric = AnswerRelevancy(
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    strictness=1  )


# metrics = [
#     Faithfulness(llm=evaluator_llm),
#     ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=1),
#     ContextRecall(llm=evaluator_llm),
#     ContextPrecision(llm=evaluator_llm),
#     ContextUtilization(llm=evaluator_llm),
#     ContextEntityRecall(llm=evaluator_llm),
#     NoiseSensitivity(llm=evaluator_llm),
#     SummarizationScore(llm=evaluator_llm),
# ]
metrics = [
    Faithfulness(llm=evaluator_llm),
    AnswerRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=1),
    ContextRecall(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextUtilization(llm=evaluator_llm),
    ContextEntityRecall(llm=evaluator_llm),
    SummaryScore(llm=evaluator_llm)
]
# metrics = [
#     Faithfulness(llm=evaluator_llm),
    
# ]



result = evaluate(
    dataset,
    column_map=column_map,
    show_progress=True,
    metrics=metrics,
    embeddings=evaluator_embeddings,
    raise_exceptions=True,
)
#result={'faithfulness': 0.7917, 'answer_relevancy': 0.7515, 'context_recall': 0.6500, 'context_precision': 0.7000, 'context_utilization': 0.8000, 'context_entity_recall': 0.2964, 'noise_sensitivity(mode=relevant)': 0.3502, 'summary_score': 0.5697}

return result

AkashKandra avatar Nov 26 '25 07:11 AkashKandra

evaluate is deprecated in collections. Can you share imports part?

anistark avatar Nov 26 '25 07:11 anistark

from ragas import ( SingleTurnSample, EvaluationDataset, evaluate ) from ragas.embeddings import LangchainEmbeddingsWrapper, embedding_factory from ragas.llms import LangchainLLMWrapper, llm_factory, InstructorLLM from ragas.metrics.collections import ( Faithfulness, AnswerRelevancy, ContextRecall, ContextPrecision, ContextUtilization, ContextEntityRecall, NoiseSensitivity, SummaryScore ) from ragas.testset import TestsetGenerator

AkashKandra avatar Nov 26 '25 07:11 AkashKandra

if evaluate is deprecated, then which evaluation function to use to evaluate the metics for my dataset?

AkashKandra avatar Nov 26 '25 07:11 AkashKandra

if evaluate is deprecated, then which evaluation function to use to evaluate the metics for my dataset?

The structure has changed in recent versions. You won't need evaluate anymore. It'll be removed soon.

For starters, would recommend to start from quickstart. The check out all available metrics as per your need. Follow the docs. You can also checkout examples and turorials.

We're in middle of updating rest of docs in the meantime. Soon, it'll be all cleaned up.

anistark avatar Nov 26 '25 10:11 anistark

Okay Thankyou!

AkashKandra avatar Nov 28 '25 07:11 AkashKandra

It seems the issue was answered, closing this now.

github-actions[bot] avatar Dec 02 '25 00:12 github-actions[bot]