ragas Significance of docstore in TestsetGenerator.

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question What is the use of docstore in TestsetGenerator. How it will affect the evaluation process if use it and if i do not use it

Code Examples

class TestsetGenerator:
    generator_llm: BaseRagasLLM
    critic_llm: BaseRagasLLM
    embeddings: BaseRagasEmbeddings
    docstore: DocumentStore

    @classmethod
    def from_llama_index(
        cls,
        generator_llm: LlamaindexLLM,
        critic_llm: LlamaindexLLM,
        embeddings: LlamaIndexEmbeddings,
        docstore: t.Optional[DocumentStore] = None,
        run_config: RunConfig = RunConfig(),
    ) -> "TestsetGenerator":
        generator_llm_model = LlamaIndexLLMWrapper(generator_llm)
        critic_llm_model = LlamaIndexLLMWrapper(critic_llm)
        embeddings_model = LlamaIndexEmbeddingsWrapper(embeddings)
        keyphrase_extractor = KeyphraseExtractor(llm=generator_llm_model)
        if docstore is None:
            from langchain.text_splitter import TokenTextSplitter

            splitter = TokenTextSplitter(chunk_size=1024, chunk_overlap=0)
            docstore = InMemoryDocumentStore(
                splitter=splitter,
                embeddings=embeddings_model,
                extractor=keyphrase_extractor,
                run_config=run_config,
            )
        return cls(
            generator_llm=generator_llm_model,
            critic_llm=critic_llm_model,
            embeddings=embeddings_model,
            docstore=docstore,
        )

Additional context When I generate the testset using TestsetGenerator.from_llama_index i found that for many of the 'questions' the gound_truth is coming as 'The answer to given question is not present in context'. Please also help me understand from where does Ragas is getting the ground_truth and contexts?

Sep 14 '24 19:09 adityamity

The significance of docstore in TestsetGenerator is that it is used to store and manage the documents during the test set generation process. If you do not use docstore, the TestsetGenerator will create an InMemoryDocumentStore by default, which handles document storage and management internally.

Using a custom docstore allows you to have more control over how documents are stored, retrieved, and managed, which can be beneficial for specific use cases or optimizations.

When generating the test set using TestsetGenerator.from_llama_index, Ragas determines the ground_truth and contexts by chunking the provided documents and adding them to the docstore. The docstore then manages these documents and provides the necessary data for the evaluation process. The InMemoryDocumentStore specifically handles the splitting of documents into smaller nodes, embedding these nodes, and extracting keyphrases, which are then used to calculate document similarities and set relationships between nodes.

Here is the relevant code snippet:

@classmethod
def from_llama_index(
    cls,
    generator_llm: LlamaindexLLM,
    critic_llm: LlamaindexLLM,
    embeddings: LlamaIndexEmbeddings,
    docstore: t.Optional[DocumentStore] = None,
    run_config: RunConfig = RunConfig(),
) -> "TestsetGenerator":
    generator_llm_model = LlamaIndexLLMWrapper(generator_llm)
    critic_llm_model = LlamaIndexLLMWrapper(critic_llm)
    embeddings_model = LlamaIndexEmbeddingsWrapper(embeddings)
    keyphrase_extractor = KeyphraseExtractor(llm=generator_llm_model)
    if docstore is None:
        from langchain.text_splitter import TokenTextSplitter
        splitter = TokenTextSplitter(chunk_size=1024, chunk_overlap=0)
        docstore = InMemoryDocumentStore(
            splitter=splitter,
            embeddings=embeddings_model,
            extractor=keyphrase_extractor,
            run_config=run_config,
        )
    return cls(
        generator_llm=generator_llm_model,
        critic_llm=critic_llm_model,
        embeddings=embeddings_model,
        docstore=docstore,
    )

And the method for generating the test set:

def generate_with_llamaindex_docs(
    self,
    documents: t.Sequence[LlamaindexDocument],
    test_size: int,
    distributions: t.Optional[Distributions] = None,
    with_debugging_logs=False,
    is_async: bool = True,
    raise_exceptions: bool = True,
    run_config: t.Optional[RunConfig] = None,
):
    distributions = distributions or {}
    # chunk documents and add to docstore
    self.docstore.add_documents(
        [Document.from_llamaindex_document(doc) for doc in documents]
    )
    return self.generate(
        test_size=test_size,
        distributions=distributions,
        with_debugging_logs=with_debugging_logs,
        is_async=is_async,
        run_config=run_config,
        raise_exceptions=raise_exceptions,
    )

The InMemoryDocumentStore class handles the addition of documents by splitting them into smaller nodes, embedding these nodes, and extracting keyphrases. It also calculates similarities between nodes and sets relationships, which are crucial for the evaluation process ^[1]^[2].

To continue talking to Dosu, mention @dosu.

Sep 14 '24 19:09 dosubot[bot]

@dosu please help me understand why for many of the questions, I am getting 'ground_truth' as 'The answer to given question is not present in context' even though the context` is available in 'contexts' column.

Sep 14 '24 19:09 adityamity

You are getting 'ground_truth' as 'The answer to given question is not present in context' for many questions because the question_answer_prompt is designed to output this verdict if the answer is not explicitly present in the context. This is evident from the examples in the prompt, such as:

{
    "context": """The novel "Pride and Prejudice" by Jane Austen revolves around the character Elizabeth Bennet and her family. The story is set in the 19th century in rural England and deals with issues of marriage, morality, and misconceptions.""",
    "question": "What year was 'Pride and Prejudice' published?",
    "answer": AnswerFormat.parse_obj(
        {
            "answer": "The answer to given question is not present in context",
            "verdict": "-1",
        }
    ).dict(),
}

In this example, the context does not contain the publication year of "Pride and Prejudice," leading to the verdict '-1'. If your context does not explicitly contain the information required to answer the question, the prompt will output that the answer is not present.

Additionally, if you are using a RAG (Retrieval-Augmented Generation) system, the ground_truth column can be ignored for online evaluations since RAG systems often perform reference-free evaluations. This means that the system does not rely on pre-defined ground-truth answers and instead evaluates the generated answers based on metrics like faithfulness, answer relevancy, and context precision ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

Sep 14 '24 19:09 dosubot[bot]

@dosu from where and how does Ragas create 'Contexts' using TestSetGenerator?

Sep 14 '24 20:09 adityamity

Ragas creates 'Contexts' using TestsetGenerator by initializing it with the necessary components such as language models, embeddings, and a document store. This is done using the from_langchain or from_llama_index class methods. Here is the relevant code:

@dataclass
class TestsetGenerator:
    generator_llm: BaseRagasLLM
    critic_llm: BaseRagasLLM
    embeddings: BaseRagasEmbeddings
    docstore: DocumentStore

    @classmethod
    def from_langchain(
        cls,
        generator_llm: LangchainLLM,
        critic_llm: LangchainLLM,
        embeddings: LangchainEmbeddings,
        docstore: t.Optional[DocumentStore] = None,
        run_config: RunConfig = RunConfig(),
        chunk_size: int = 1024,
    ) -> "TestsetGenerator":
        generator_llm_model = LangchainLLMWrapper(generator_llm)
        critic_llm_model = LangchainLLMWrapper(critic_llm)
        embeddings_model = LangchainEmbeddingsWrapper(embeddings)

        keyphrase_extractor = KeyphraseExtractor(llm=generator_llm_model)
        if docstore is None:
            from langchain.text_splitter import TokenTextSplitter

            splitter = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
            docstore = InMemoryDocumentStore(
                splitter=splitter,
                embeddings=embeddings_model,
                extractor=keyphrase_extractor,
                run_config=run_config,
            )
        return cls(
            generator_llm=generator_llm_model,
            critic_llm=critic_llm_model,
            embeddings=embeddings_model,
            docstore=docstore,
        )

    @classmethod
    def from_llama_index(
        cls,
        generator_llm: LlamaindexLLM,
        critic_llm: LlamaindexLLM,
        embeddings: LlamaIndexEmbeddings,
        docstore: t.Optional[DocumentStore] = None,
        run_config: RunConfig = RunConfig(),
    ) -> "TestsetGenerator":
        generator_llm_model = LlamaIndexLLMWrapper(generator_llm)
        critic_llm_model = LlamaIndexLLMWrapper(critic_llm)
        embeddings_model = LlamaIndexEmbeddingsWrapper(embeddings)
        keyphrase_extractor = KeyphraseExtractor(llm=generator_llm_model)
        if docstore is None:
            from langchain.text_splitter import TokenTextSplitter

            splitter = TokenTextSplitter(chunk_size=1024, chunk_overlap=0)
            docstore = InMemoryDocumentStore(
                splitter=splitter,
                embeddings=embeddings_model,
                extractor=keyphrase_extractor,
                run_config=run_config,
            )
        return cls(
            generator_llm=generator_llm_model,
            critic_llm=critic_llm_model,
            embeddings=embeddings_model,
            docstore=docstore,
        )

This code shows how TestsetGenerator is initialized with the necessary components to create contexts ^[1].

To continue talking to Dosu, mention @dosu.

Sep 14 '24 20:09 dosubot[bot]

@dosu Let me rephrase my question. To evaluate my RAG application build on llamaindex, i am using RAGAS. To evaluate using RAGAS we need to first generate the testset using TestsetGenerator with the below code:


from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.from_llama_index(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings,
)
# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents,
    test_size=5,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

This testset contains columns 'question', 'contexts', 'ground_truth' ,'evolution_type','metadata','episode_done'.

When is did the evaluation using the testset using the below code:

from ragas.integrations.llama_index import evaluate
ds = testset.to_dataset()

ds_dict = ds.to_dict()
result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=ds_dict,
    llm=evaluator_llm,
    embeddings=OpenAIEmbedding(),
)

I found that for many questions, the ground_truth is coming as 'The answer to given question is not present in context' but the answer column has a correct answer. Is it RAGAS not able to generate the ground_truth properly or the contexts are not properly fetched. How to handle such cases?

Sep 14 '24 21:09 adityamity

The ground_truth is coming as 'The answer to given question is not present in context' for many questions because the contexts retrieved by the query engine do not contain the expected answer. This discrepancy occurs even though the answer column has a correct answer.

To handle such cases, ensure that the contexts retrieved by the query engine are relevant and contain the information needed to answer the question. You can improve the retrieval process by:

Improving the quality of the embeddings: Ensure that the embeddings used for document retrieval are of high quality and accurately represent the content.
Tuning the retrieval parameters: Adjust parameters such as similarity_top_k to retrieve more relevant contexts.
Enhancing the document chunking strategy: Ensure that documents are chunked in a way that preserves the context needed to answer questions.

Additionally, you can use the Ragas framework to objectively measure and tweak the performance of your QueryEngine. Here is a guide on how to integrate Ragas with LlamaIndex and evaluate your QueryEngine:

Load the documents:

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./nyc_wikipedia").load_data()

Initialize the TestsetGenerator:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

generator_llm = OpenAI(model="gpt-3.5-turbo-16k")
critic_llm = OpenAI(model="gpt-4")
embeddings = OpenAIEmbedding()

generator = TestsetGenerator.from_llama_index(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings,
)

Generate the testset:

testset = generator.generate_with_llamaindex_docs(
    documents,
    test_size=5,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)
df = testset.to_pandas()

Build the QueryEngine:

from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)
query_engine = vector_index.as_query_engine()

Evaluate the QueryEngine:

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from ragas.metrics.critique import harmfulness

metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    harmfulness,
]

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

evaluator_llm = OpenAI(model="gpt-3.5-turbo")

ds = testset.to_dataset()
ds_dict = ds.to_dict()

from ragas.integrations.llama_index import evaluate

result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=ds_dict,
    llm=evaluator_llm,
    embeddings=OpenAIEmbedding(),
)

result.to_pandas()

By following these steps, you can ensure that the contexts retrieved are relevant to the questions, thereby avoiding the issue of 'The answer to given question is not present in context' ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Sep 14 '24 21:09 dosubot[bot]

@dosu I am using AzureAISearchVectorStore , what changes i need to do to handle the above scenario ?

Sep 14 '24 21:09 adityamity

To handle the scenario where the ground_truth is coming as 'The answer to given question is not present in context' while using AzureAISearchVectorStore, you need to ensure that the contexts retrieved are relevant and contain the necessary information. Here are the steps to integrate AzureAISearchVectorStore and improve the retrieval process:

Initialize Azure OpenAI Models and Embeddings:

azure_configs = {
    "base_url": "https://<your-endpoint>.openai.azure.com/",
    "model_deployment": "your-deployment-name",
    "model_name": "your-model-name",
    "embedding_deployment": "your-deployment-name",
    "embedding_name": "text-embedding-ada-002",  # most likely
}

from langchain_openai.chat_models import AzureChatOpenAI
from langchain_openai.embeddings import AzureOpenAIEmbeddings

azure_model = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_configs["base_url"],
    azure_deployment=azure_configs["model_deployment"],
    model=azure_configs["model_name"],
    validate_base_url=False,
)

azure_embeddings = AzureOpenAIEmbeddings(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_configs["base_url"],
    azure_deployment=azure_configs["embedding_deployment"],
    model=azure_configs["embedding_name"],
)

Initialize the TestsetGenerator with Azure Models and Embeddings:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.from_langchain(
    generator_llm=azure_model,
    critic_llm=azure_model,
    embeddings=azure_embeddings,
)

Generate the Testset:

testset = generator.generate_with_langchain_docs(
    documents,
    test_size=5,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)
df = testset.to_pandas()

Build the QueryEngine with AzureAISearchVectorStore: Ensure that your QueryEngine is properly configured to use AzureAISearchVectorStore for context retrieval. This might involve setting up the vector store and ensuring it uses the Azure embeddings.

Evaluate the QueryEngine:

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from ragas.metrics.critique import harmfulness

metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    harmfulness,
]

evaluator_llm = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_configs["base_url"],
    azure_deployment=azure_configs["model_deployment"],
    model=azure_configs["model_name"],
    validate_base_url=False,
)

ds = testset.to_dataset()
ds_dict = ds.to_dict()

from ragas.integrations.llama_index import evaluate

result = evaluate(
    query_engine=query_engine,
    metrics=metrics,
    dataset=ds_dict,
    llm=evaluator_llm,
    embeddings=azure_embeddings,
)

result.to_pandas()

By following these steps, you can ensure that the contexts retrieved are relevant to the questions, thereby avoiding the issue of 'The answer to given question is not present in context' ^[1].

To continue talking to Dosu, mention @dosu.

Sep 14 '24 21:09 dosubot[bot]

docstore no longer exists in recent versions. Checkout KnowledgeGraph and latest docs for current usage.

Nov 03 '25 07:11 anistark

It seems the issue was answered, closing this now.

Nov 07 '25 00:11 github-actions[bot]