ragas BadRequestError when generating testset using TestsetGenerator from_llama

i am creating a testset using my nodes with below codes:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.from_llama_index(
    generator_llm =llm_model,
    critic_llm = llm_model,
    embeddings = embed_model
)

testset = generator.generate_with_llamaindex_docs(
    nodes,
    test_size=10,
    distributions = {
    "simple": 0.5,
    "multi_context": 0.4,
    "reasoning": 0.1
            }
)

however i am getting below error :

BadRequestError: Error code: 400 - {'error': {'message': "Sorry! We've encountered an issue with repetitive patterns in your prompt. Please try again with a different prompt.", 'type': 'invalid_request_error', 'param': 'prompt', 'code': 'invalid_prompt'}}

Sep 13 '24 03:09 adityamity

To resolve the BadRequestError with error code 400 when generating a testset using TestsetGenerator from from_llama_index, ensure that all required parameters are correctly provided and valid. Here is an example of how to properly use the from_llama_index method:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from ragas.llms import LlamaindexLLM
from ragas.embeddings import LlamaIndexEmbeddings
from ragas.testset.docstore import DocumentStore

# Initialize your LlamaindexLLM, LlamaIndexEmbeddings, and DocumentStore instances
generator_llm = LlamaindexLLM(api_key="your_api_key")
critic_llm = LlamaindexLLM(api_key="your_api_key")
embeddings = LlamaIndexEmbeddings(api_key="your_api_key")
docstore = DocumentStore()

# Create the TestsetGenerator instance
generator = TestsetGenerator.from_llama_index(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings,
    docstore=docstore
)

# Now you can use the generator to generate testsets
nodes = [...]  # Your list of Node instances
test_size = 10  # Example test size

testset = generator.generate_with_llamaindex_docs(
    nodes=nodes,
    test_size=test_size,
    distributions={
        "simple": 0.5,
        "multi_context": 0.4,
        "reasoning": 0.1
    }
)

Ensure that the LlamaindexLLM, LlamaIndexEmbeddings, and DocumentStore instances are correctly initialized and that the nodes list contains valid Node instances. Additionally, verify that the distributions values sum to 1.0 ^[1].

To continue talking to Dosu, mention @dosu.

Sep 13 '24 04:09 dosubot[bot]

@dosu i am using in the same manner , below is how my nodes look like:

when i run my code, i shows me below status:

embedding nodes: 95%|█████████▌| 19/20 [00:02<00:00, 6.64it/s]

but every time it got stuck at 92-95 %, irrespective of the number of node.

Sep 13 '24 04:09 adityamity

Your code likely gets stuck at 92-95% when embedding nodes using TestsetGenerator with NodeWithScore objects due to the embedding process itself. The embedding process can be computationally intensive and may involve network calls to external services, which can cause delays or timeouts.

In the provided code, the embedding process is handled by the doc_extractor.embed method:

nodes = doc_extractor.embed(
    nodes,
    ["page_content", "summary"],
    {
        "page_content": [
            NodeLevel.LEVEL_1,
            NodeLevel.LEVEL_2,
            NodeLevel.LEVEL_3,
        ],
        "summary": [NodeLevel.LEVEL_0],
    },
)

This method call is likely where the bottleneck occurs. The embedding process involves generating embeddings for the page_content and summary fields of the nodes, which can be time-consuming, especially if the dataset is large or if the embedding service is slow.

To address this issue, you might consider the following approaches:

Optimize the Embedding Process: Ensure that the embedding service is efficient and can handle the load. If you are using an external API, check for rate limits and optimize the number of requests.
Parallel Processing: If not already implemented, consider parallelizing the embedding process to distribute the load across multiple threads or processes.
Batch Processing: Process the nodes in smaller batches to avoid overwhelming the embedding service.

Additionally, you can use a custom embedding class like FakeEmbeddings from the ragas framework, which can help in testing and debugging by providing a controlled environment for embeddings:

class FakeEmbeddings(Embeddings):
    def __init__(self):
        path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "test_embs.pkl")
        with open(path, "rb") as f:
            self.embeddings: dict[str, t.Any] = pickle.load(f)

    def _get_embedding(self, text: str) -> t.List[float]:
        if text in self.embeddings:
            return self.embeddings[text]
        else:
            return [0] * 768

    def embed_documents(self, texts: t.List[str]) -> t.List[t.List[float]]:
        return [self._get_embedding(text) for text in texts]

    def embed_query(self, text: str) -> t.List[float]:
        return self._get_embedding(text)

    async def aembed_query(self, text: str) -> t.List[float]:
        return self._get_embedding(text)

Using FakeEmbeddings can help you simulate the embedding process and identify potential bottlenecks without relying on external services ^[1]^[2].

To continue talking to Dosu, mention @dosu.

Sep 13 '24 04:09 dosubot[bot]

@dosu I am getting error:

from ragas.testset.docstore import DocumentStore
docstore = DocumentStore()

TypeError: Can't instantiate abstract class DocumentStore with abstract methods add_documents, add_nodes, get_node, get_random_nodes, get_similar

Sep 13 '24 06:09 adityamity

To resolve the TypeError: Can't instantiate abstract class DocumentStore with abstract methods add_documents, add_nodes, get_node, get_random_nodes, get_similar, you need to create a concrete subclass of DocumentStore that implements all the abstract methods. Here is an example of how you can do this:

from typing import Sequence, List, Union
from ragas.testset.docstore import DocumentStore
from ragas.testset.docstore import Document, Node

class MyDocumentStore(DocumentStore):
    def __init__(self):
        self.documents = {}

    def add_documents(self, docs: Sequence[Document], show_progress=True):
        for doc in docs:
            self.documents[doc.doc_id] = doc

    def add_nodes(self, nodes: Sequence[Node], show_progress=True):
        for node in nodes:
            self.documents[node.doc_id] = node

    def get_node(self, node_id: str) -> Node:
        return self.documents[node_id]

    def get_random_nodes(self, k=1) -> List[Node]:
        import random
        return random.sample(list(self.documents.values()), k)

    def get_similar(
        self, node: Node, threshold: float = 0.7, top_k: int = 3
    ) -> Union[List[Document], List[Node]]:
        # Implement your similarity logic here
        return []

# Now you can create an instance of MyDocumentStore
docstore = MyDocumentStore()

This subclass MyDocumentStore provides concrete implementations for all the abstract methods in DocumentStore, allowing you to instantiate it without errors ^[1].

To continue talking to Dosu, mention @dosu.

Sep 13 '24 06:09 dosubot[bot]

BadRequestError when generating testset using TestsetGenerator from_llama_index.