ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Ragas evaluation

Open Senthselvi opened this issue 1 year ago • 27 comments

Faithfulness is not callable

Senthselvi avatar Sep 12 '24 19:09 Senthselvi

To resolve the issue where "Faithfulness is not callable" in the Ragas evaluation, you need to ensure that you are instantiating the Faithfulness class correctly. The error suggests that you might be trying to call an instance of Faithfulness as if it were a function.

Here is the correct way to instantiate and use the Faithfulness class:

faithfulness = Faithfulness()

Make sure you are not trying to call faithfulness as a function. For example, if you have something like this:

score = faithfulness()

You should change it to:

# Assuming you have a sample and callbacks defined
score = await faithfulness._single_turn_ascore(sample, callbacks)

Ensure that you are using the methods provided by the Faithfulness class correctly [1].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 12 '24 19:09 dosubot[bot]

Ok got it

Senthselvi avatar Sep 12 '24 19:09 Senthselvi

why is Empty DataFrame is returned? llm, embed_model = init_llm_ollama(base_url, bearer_token) def evaluate( dataset: List[LLMTestCase], metrics: List[Callable], llm: llm, embeddings: embed_model, raise_exceptions: bool = True, ) -> pd.DataFrame: results = []

# Convert results to DataFrame
results_df = pd.DataFrame(results)
return results_df

Sample data

data = { 'question': ['When was the first super bowl?', 'Who won the most super bowls?'], 'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'], 'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']], 'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times'] } data_samples = Dataset.from_dict(data)

Example usage

test_cases = [ LLMTestCase( question=question, answer=answer, contexts=context, ground_truth=ground_truth ) for question, answer, ground_truth, context in zip( data_samples['question'], data_samples['answer'], data_samples['ground_truth'], data_samples['contexts'] ) ]

List of metrics functions

metrics = [ answer_relevancy, # context_precision, # faithfulness, # context_recall ]

Evaluate

async def main(): results_df = evaluate( dataset=test_cases, metrics=[Faithfulness], llm=llm, embeddings=embed_model, raise_exceptions=False )

print(results_df)
results_df.to_csv(r'C:\TestingBot\Scripts\V1\myfile.csv', sep=',')

Run the async main function

asyncio.run(main())

Senthselvi avatar Sep 13 '24 07:09 Senthselvi

An Empty DataFrame is returned when evaluating the dataset using the evaluate function because the dataset specified by dataset_name does not exist in LangSmith. The function raises a ValueError if the dataset is not found, which indicates that the dataset must be present in LangSmith for the evaluation to proceed [1].

Additionally, the evaluate function in the Ragas framework requires the dataset to be in a specific format with columns such as question, contexts, answer, and ground_truth. If the dataset does not conform to this format or if the required columns are missing, the function will raise a ValueError, leading to an empty or non-existent result [2].

Ensure that your dataset is correctly formatted and exists in LangSmith. Also, make sure you have set the necessary environment variables, such as OPENAI_API_KEY, if you are using metrics that require access to external APIs like OpenAI.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 13 '24 07:09 dosubot[bot]

Do ragas support llama3.1:8b model, getting NAN output

Senthselvi avatar Sep 14 '24 18:09 Senthselvi

Common reasons for getting NaN outputs in the Ragas evaluation framework include:

  1. Exceptions during metric evaluation: If any of the metrics fail during evaluation and raise_exceptions is set to False, the evaluation will return np.nan for the row that failed.
  2. Validation failures: If the required columns for the metrics are missing or if the columns are of the wrong format, it can lead to NaN outputs.
  3. Unsupported sample type: If the sample type is unsupported, it can cause issues in the evaluation process, potentially leading to NaN outputs.
  4. Initialization issues: If the language model (LLM) or embeddings are not properly initialized, it can result in NaN outputs for metrics that depend on them [1].

Regarding support for the llama3.1:8b model, I couldn't find specific information in the repository. However, you can refer to the example code for integrating LlamaIndex with Ragas, which might help you set up and evaluate your model correctly [2].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 14 '24 18:09 dosubot[bot]

Python version - 3.11 Ragas- 0.1.16

Senthselvi avatar Sep 14 '24 18:09 Senthselvi

share me the correct evaluation metrics

Senthselvi avatar Sep 14 '24 18:09 Senthselvi

The correct evaluation metrics to use with Ragas version 0.1.16 in Python 3.11 are:

  1. context_precision
  2. context_recall
  3. faithfulness
  4. answer_relevancy
  5. context_utilization
  6. context_entity_recall
  7. noise_sensitivity
  8. summarization_score
  9. answer_semantic_similarity
  10. answer_correctness

You can import and use these metrics as follows:

from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_utilization,
    context_entity_recall,
    noise_sensitivity,
    summarization_score,
    answer_semantic_similarity,
    answer_correctness,
)

These metrics measure various aspects of your Retrieval Augmented Generation (RAG) systems:

  1. Faithfulness - Measures the factual consistency of the answer to the context based on the question.
  2. Context_precision - Measures how relevant the retrieved context is to the question, conveying the quality of the retrieval pipeline.
  3. Answer_relevancy - Measures how relevant the answer is to the question.
  4. Context_recall - Measures the retriever's ability to retrieve all necessary information required to answer the question.
  5. Context_utilization - Evaluates how effectively the retrieved context is used in generating the answer.
  6. Context_entity_recall - Measures the recall of entities in the retrieved context.
  7. Noise_sensitivity - Assesses the robustness of the system to noisy inputs.
  8. Summarization_score - Evaluates the quality of summaries generated by the system.
  9. Answer_semantic_similarity - Measures the semantic similarity between the generated answer and the expected answer.
  10. Answer_correctness - Evaluates the correctness of the generated answer.

To run the evaluation on your dataset with these metrics, you can use the evaluate function provided by Ragas:

from ragas import evaluate
from datasets import Dataset

# Example dataset
dataset = Dataset.from_dict({
    'question': ["What is the capital of France?"],
    'contexts': [["Paris is the capital of France."]],
    'answer': ["Paris"],
    'ground_truth': [["Paris"]]
})

# Run evaluation
result = evaluate(dataset)
print(result)

This will give you a comprehensive view of your RAG system's performance based on the specified metrics [1][2][3].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 14 '24 18:09 dosubot[bot]

error while evaluating verify that the data type in your code aligns with the expected pyarrow data types.

Senthselvi avatar Sep 14 '24 18:09 Senthselvi

To ensure that your data types align with the expected pyarrow data types in the Ragas evaluation framework, verify that your dataset columns have the following types:

  • question: list<string>
  • contexts: list<list<string>>
  • answer: list<string>
  • ground_truth: list<list<string>>

Here's an example of how to prepare your dataset with the correct data types:

from datasets import Dataset

# Example dataset
data = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts': [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
                 ['The Green Bay Packers...Green Bay, Wisconsin.', 'The Packers compete...Football Conference']],
    'ground_truth': [['The first superbowl was held on January 15, 1967'], ['The New England Patriots have won the Super Bowl a record six times']]
}

# Convert to Hugging Face Dataset
dataset = Dataset.from_dict(data)

# Verify the data types
print(dataset)

# Example usage with Ragas evaluation
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_utilization,
    context_entity_recall,
    noise_sensitivity,
    summarization_score,
    answer_semantic_similarity,
    answer_correctness,
)

# List of metrics
metrics = [
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_utilization,
    context_entity_recall,
    noise_sensitivity,
    summarization_score,
    answer_semantic_similarity,
    answer_correctness,
]

# Run evaluation
result = evaluate(dataset, metrics=metrics)
print(result)

Ensure that your dataset conforms to these data types to avoid errors during evaluation [1][2][3].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 14 '24 18:09 dosubot[bot]

'dict' object has no attribute 'rename_columns'

Senthselvi avatar Sep 14 '24 19:09 Senthselvi

You are getting the error "'dict' object has no attribute 'rename_columns'" because the rename_columns method is being called on a dictionary object instead of a Dataset object. The rename_columns method is a method of the Dataset class from the datasets library, not a dictionary.

In the provided code, the rename_columns method is used in the convert_v1_to_v2_dataset and convert_v2_to_v1_dataset functions, which expect a Dataset object as their input. If you pass a dictionary to these functions, you will encounter this error.

Here is the relevant code snippet:

def convert_v1_to_v2_dataset(dataset: Dataset) -> Dataset:
    columns_map = {v: k for k, v in REQUIRED_COLS_v1.items() if v in dataset.features}
    return dataset.rename_columns(columns_map)

def convert_v2_to_v1_dataset(dataset: Dataset) -> Dataset:
    columns_map = {k: v for k, v in REQUIRED_COLS_v1.items() if k in dataset.features}
    return dataset.rename_columns(columns_map)

Additionally, you can refer to the remap_column_names function from the ragas library, which also demonstrates how to remap column names in a Dataset object:

def remap_column_names(dataset: Dataset, column_map: dict[str, str]) -> Dataset:
    """
    Remap the column names in case dataset uses different column names
    """
    inverse_column_map = {v: k for k, v in column_map.items()}
    return dataset.rename_columns(inverse_column_map)

Make sure that you are passing a Dataset object to these functions, not a dictionary [1][2].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 14 '24 19:09 dosubot[bot]

AttributeError: 'LLMTestCase' object has no attribute 'items' data = LLMTestCase( question=query, answer=actual_output, ground_truth="Vector databases store and provide access to unstructured data", contexts=retrieval_context, )

Senthselvi avatar Sep 14 '24 21:09 Senthselvi

The AttributeError: 'LLMTestCase' object has no attribute 'items' suggests that somewhere in your code, an LLMTestCase object is being treated as a dictionary. This typically happens when a method or function expects a dictionary but receives an LLMTestCase object instead.

To resolve this, ensure that you are correctly using the LLMTestCase class and not mistakenly treating it as a dictionary. Here is an example of how to correctly instantiate and use an LLMTestCase object:

# Assuming LLMTestCase is defined correctly
data = LLMTestCase(
    question=query,
    answer=actual_output,
    ground_truth="Vector databases store and provide access to unstructured data",
    contexts=retrieval_context,
)

# Example usage
print(data.question)  # Accessing attributes of LLMTestCase
print(data.answer)
print(data.ground_truth)
print(data.contexts)

If you need to convert an LLMTestCase object to a dictionary for some reason, you can manually create the dictionary:

data_dict = {
    'question': data.question,
    'answer': data.answer,
    'ground_truth': data.ground_truth,
    'contexts': data.contexts,
}

# Now you can use data_dict as a dictionary

Ensure that any function or method expecting a dictionary receives data_dict instead of the LLMTestCase object directly. This should resolve the AttributeError.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 14 '24 21:09 dosubot[bot]

ValueError: Dataset feature "contexts" should be of type Sequence[string], got <class 'datasets.features.features.Sequence'> data = { 'question': [query], # List with a single question 'answer': [actual_output], # List with a single answer 'contexts': [retrieval_context], # List with a single list of contexts (retrieval_context should be a list of strings) 'ground_truth': [ "Vector databases store and provide access to unstructured data, such as text or images, in the form of their vector embeddings. " "Because similar objects are close together in vector space, the similarity of data objects can be calculated based on the distance between the data object’s vector embeddings." ] # List with a single ground truth }

Senthselvi avatar Sep 14 '24 21:09 Senthselvi

To resolve the ValueError: "Dataset feature 'contexts' should be of type Sequence[string], got <class 'datasets.features.features.Sequence>'", you need to ensure that the contexts field is correctly specified as a sequence of strings. Here is how you can adjust your dataset creation to meet the expected format:

  1. Ensure that retrieval_context is a list of strings.
  2. Explicitly define the feature types when creating the dataset.

Here is the corrected code snippet:

from datasets import Dataset, Features, Sequence, Value

# Ensure retrieval_context is a list of strings
retrieval_context = [
    "Context 1",
    "Context 2",
    # Add more contexts as needed
]

# Define your data
data = {
    'question': [query],  # List with a single question
    'answer': [actual_output],  # List with a single answer
    'contexts': [retrieval_context],  # List with a single list of contexts
    'ground_truth': [
        "Vector databases store and provide access to unstructured data, such as text or images, in the form of their vector embeddings. "
        "Because similar objects are close together in vector space, the similarity of data objects can be calculated based on the distance between the data object’s vector embeddings."
    ]  # List with a single ground truth
}

# Define the features explicitly
features = Features({
    'question': Sequence(Value('string')),
    'answer': Sequence(Value('string')),
    'contexts': Sequence(Sequence(Value('string'))),
    'ground_truth': Sequence(Value('string'))
})

# Create the dataset
dataset = Dataset.from_dict(data, features=features)

# Verify the dataset
print(dataset)

This code ensures that the contexts field is recognized as a sequence of strings, which should resolve the ValueError [1][2][3].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 14 '24 22:09 dosubot[bot]

KeyError('summary') Exception raised in Job[2]: TypeError(CustomOllama.agenerate_text() got an unexpected keyword argument 'n')

Senthselvi avatar Sep 14 '24 22:09 Senthselvi

You are encountering a KeyError('summary') and a TypeError stating that CustomOllama.agenerate_text() got an unexpected keyword argument 'n'.

  1. KeyError('summary'):

    • Ensure that the data structure you are accessing contains the 'summary' key. The expected keys are summary and questions [1][2].
  2. TypeError(CustomOllama.agenerate_text() got an unexpected keyword argument 'n'):

    • The CustomOllama class's agenerate_text method does not support the n argument. Update the method signature to include the n argument:
async def agenerate_text(
    self,
    prompt: PromptValue,
    n: int = 1,
    temperature: t.Optional[float] = None,
    stop: t.Optional[t.List[str]] = None,
    callbacks: t.List = [],
) -> LLMResult:
    # Your implementation here

This should resolve the issues you are facing [3][4].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 14 '24 22:09 dosubot[bot]

AttributeError('CustomOllama' object has no attribute 'model_name')

Senthselvi avatar Sep 15 '24 06:09 Senthselvi

The AttributeError: 'CustomOllama' object has no attribute 'model_name' indicates that the CustomOllama class is missing the model_name attribute. To resolve this, you need to ensure that the CustomOllama class includes the model_name attribute.

Here is an example of how you might define the CustomOllama class with the model_name attribute:

class CustomOllama:
    def __init__(self, model_name: str, other_params: dict):
        self.model_name = model_name
        # Initialize other attributes as needed
        for key, value in other_params.items():
            setattr(self, key, value)

    async def agenerate_text(
        self,
        prompt: str,
        n: int = 1,
        temperature: float = 1e-8,
        stop: t.Optional[t.List[str]] = None,
        callbacks: t.List = [],
    ) -> LLMResult:
        # Your implementation here
        pass

    # Add other methods as needed

Ensure that when you instantiate the CustomOllama class, you provide the model_name:

ollama = CustomOllama(model_name="llama3.1:8b", other_params={})

This should resolve the AttributeError by ensuring that the model_name attribute is properly defined and accessible in the CustomOllama class.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 15 '24 06:09 dosubot[bot]

Object of type PromptValue is not JSON serializable ragas evaluation

Senthselvi avatar Sep 18 '24 18:09 Senthselvi

To resolve the "Object of type PromptValue is not JSON serializable" error in Ragas evaluation, you need to ensure that the PromptValue object is converted to a JSON-serializable format before attempting to serialize it. This can be done by implementing a custom to_dict method in the PromptValue class and using it during serialization.

Here is an example of how you can modify the PromptValue class to include a to_dict method:

class PromptValue(BasePromptValue):
    prompt_str: str

    def to_messages(self) -> t.List[BaseMessage]:
        """Return prompt as a list of Messages."""
        return [HumanMessage(content=self.to_string())]

    def to_string(self) -> str:
        return self.prompt_str

    def to_dict(self) -> t.Dict[str, t.Any]:
        """Convert the PromptValue to a dictionary."""
        return {"prompt_str": self.prompt_str}

Then, when you need to serialize the PromptValue object, you can use the to_dict method:

import json

prompt_value = PromptValue(prompt_str="Your prompt string here")
json_data = json.dumps(prompt_value.to_dict())

This ensures that the PromptValue object is converted to a JSON-serializable dictionary before serialization [1].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 18 '24 18:09 dosubot[bot]

Mean answer_correctness Score N/A

Senthselvi avatar Sep 19 '24 05:09 Senthselvi

The mean answer_correctness score is showing as N/A because the function _ascore returns np.nan when the answers variable is None. This happens if the _output_parser.aparse method fails to parse the result text.

answers = await _output_parser.aparse(
    result_text, p_value, self.llm, self.max_retries
)
if answers is None:
    return np.nan

Ensure that the result text conforms to the expected format and schema to avoid parsing failures [1].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 19 '24 05:09 dosubot[bot]

Error in str.on_chain_start callback: AttributeError("'str' object has no attribute 'ignore_chain'") AttributeError in AnswerSimilarity callback: 'str' object has no attribute 'raise_error'

Senthselvi avatar Sep 19 '24 10:09 Senthselvi

The errors you're encountering indicate that a string object is being used where a callback object is expected. Specifically, the str object does not have the attributes ignore_chain and raise_error, which are expected in the callback objects.

Resolving the Errors

  1. Error in str.on_chain_start callback: AttributeError("'str' object has no attribute 'ignore_chain'"):

    • Ensure that the on_chain_start method is being called on a valid callback object, not a string. The callback object should have the ignore_chain attribute.
  2. AttributeError in AnswerSimilarity callback: 'str' object has no attribute 'raise_error':

    • Ensure that the AnswerSimilarity callback is correctly instantiated and used. The callback object should have the raise_error attribute.

Correct Usage of Callbacks

Here is an example of how to correctly instantiate and use callback objects in the Ragas framework:

from ragas.callbacks import CallbackManagerForChainRun, CallbackManagerForChainGroup

# Example callback manager instantiation
callback_manager = CallbackManagerForChainRun(
    handlers=[],
    inheritable_handlers=[],
    parent_run_id=None,
    tags=[],
    inheritable_tags=[],
    metadata={},
    inheritable_metadata={}
)

# Ensure you are using the correct callback objects
callbacks = [callback_manager]

# Example usage in evaluation
result = await evaluate(
    dataset=test_cases,
    metrics=[Faithfulness],
    llm=llm,
    embeddings=embed_model,
    raise_exceptions=False,
    callbacks=callbacks
)

Attributes of Callback Objects

Ensure that your callback objects have the necessary attributes:

  • ignore_chain
  • raise_error

These attributes are part of the expected structure of callback objects in the Ragas framework [1][2].

Example of a Custom Callback

If you need to create a custom callback, ensure it includes the required attributes:

class CustomCallback:
    def __init__(self):
        self.ignore_chain = False
        self.raise_error = True

    def on_chain_start(self, *args, **kwargs):
        # Your implementation here
        pass

    def on_chain_end(self, *args, **kwargs):
        # Your implementation here
        pass

    def on_chain_error(self, *args, **kwargs):
        # Your implementation here
        pass

By ensuring that your callback objects are correctly instantiated and include the necessary attributes, you should be able to resolve the AttributeError issues.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 19 '24 10:09 dosubot[bot]

Looks like syntax issue. With current state, try: from ragas.metrics import Faithfulness and then faithfulness_metric = Faithfulness()

Or check out the new llm_factory.

anistark avatar Oct 30 '25 12:10 anistark

It seems the issue was answered, closing this now.

github-actions[bot] avatar Nov 04 '25 00:11 github-actions[bot]