ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Generate test data for 1 pdf

Open wanjeakshay opened this issue 1 year ago • 13 comments

I have load pdf as document, now want to generate test data from it, error is occuring.


ExceptionInRunner Traceback (most recent call last) in <cell line: 1>() 4 ) 5 ----> 6 testset = generator.generate_with_langchain_docs( 7 documents, 8 test_size=10,

2 frames /usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate_with_langchain_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 204 distributions = distributions or {} 205 # chunk documents and add to docstore --> 206 self.docstore.add_documents( 207 [Document.from_langchain_document(doc) for doc in documents] 208 )

/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_documents(self, docs, show_progress) 212 for d in self.splitter.transform_documents(docs) 213 ] --> 214 self.add_nodes(nodes, show_progress=show_progress) 215 216 def add_nodes(self, nodes: t.Sequence[Node], show_progress=True):

/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_nodes(self, nodes, show_progress) 251 results = executor.results() 252 if not results: --> 253 raise ExceptionInRunner() 254 255 for i, n in enumerate(nodes):

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass raise_exceptions=False incase you want to show only a warning message instead.

code which i am running with get_openai_callback() as cb: generator = TestsetGenerator.from_langchain( generator_llm=azure_model, critic_llm=azure_model, embeddings=azure_embeddings )

testset = generator.generate_with_langchain_docs( documents, test_size=10, raise_exceptions=True, # Change this to True to see the original exception with_debugging_logs=False, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, )

print(f"Total Tokens: {cb.total_tokens}") print(f"Completion Tokens: {cb.completion_tokens}") print(f"Total Cost (USD): ${cb.total_cost}")

wanjeakshay avatar Sep 17 '24 07:09 wanjeakshay

To generate test data from a loaded PDF document using the generate_with_langchain_docs method without encountering the ExceptionInRunner error, you should set the raise_exceptions parameter to False. This will handle exceptions internally and prevent the error from propagating. Here is an example of how you can modify your code:

from langchain_community.document_loaders import DirectoryLoader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# Assuming `documents` is a list of LCDocument objects created from the loaded PDF
loader = DirectoryLoader(
    "./path_to_your_pdfs/", use_multithreading=True, silent_errors=True, sample_size=1
)
documents = loader.load()

# Initialize your Azure model and embeddings
azure_model = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_endpoint="https://<your-endpoint>.openai.azure.com/",
    azure_deployment="your-deployment-name",
    model="your-model-name",
    validate_base_url=False,
)

azure_embeddings = AzureOpenAIEmbeddings(
    openai_api_version="2023-05-15",
    azure_endpoint="https://<your-endpoint>.openai.azure.com/",
    azure_deployment="your-deployment-name",
    model="text-embedding-ada-002",
)

# Create the test set generator
generator = TestsetGenerator.from_langchain(
    generator_llm=azure_model, critic_llm=azure_model, embeddings=azure_embeddings
)

# Generate the test set
test_data = generator.generate_with_langchain_docs(
    documents=documents,
    test_size=10,  # Example test size
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
    with_debugging_logs=False,
    is_async=True,
    raise_exceptions=False,  # Set to False to avoid ExceptionInRunner error
    run_config=None  # Or provide specific run configuration if needed
)

# Convert to pandas DataFrame if needed
test_data_df = test_data.to_pandas()

print(f"Total Tokens: {cb.total_tokens}")
print(f"Completion Tokens: {cb.completion_tokens}")
print(f"Total Cost (USD): ${cb.total_cost}")

By setting raise_exceptions to False, the method will handle exceptions internally and prevent the ExceptionInRunner error from propagating [1][2].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 17 '24 07:09 dosubot[bot]

@dosu still getting an error


ExceptionInRunner Traceback (most recent call last) in <cell line: 7>() 5 6 # Generate the test set ----> 7 test_data = generator.generate_with_langchain_docs( 8 documents=documents, 9 test_size=10, # Example test size

2 frames /usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate_with_langchain_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 204 distributions = distributions or {} 205 # chunk documents and add to docstore --> 206 self.docstore.add_documents( 207 [Document.from_langchain_document(doc) for doc in documents] 208 )

/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_documents(self, docs, show_progress) 212 for d in self.splitter.transform_documents(docs) 213 ] --> 214 self.add_nodes(nodes, show_progress=show_progress) 215 216 def add_nodes(self, nodes: t.Sequence[Node], show_progress=True):

/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_nodes(self, nodes, show_progress) 251 results = executor.results() 252 if not results: --> 253 raise ExceptionInRunner() 254 255 for i, n in enumerate(nodes):

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass raise_exceptions=False incase you want to show only a warning message instead.

wanjeakshay avatar Sep 17 '24 07:09 wanjeakshay

To resolve the ExceptionInRunner error when generating test data using the generate_with_langchain_docs method, ensure that the test_data_rows list is not empty after the execution of the exec.results() method. This error is raised if test_data_rows is empty, indicating that no valid test data was generated.

Here are some steps to troubleshoot and resolve the issue:

  1. Ensure Valid Documents: Verify that the documents provided to the generate_with_langchain_docs method are valid and properly formatted.

  2. Check Distributions: Ensure that the distributions parameter sums to 1.0.

  3. Initialize Docstore Correctly: Make sure that the docstore is correctly initialized and contains the necessary documents.

  4. Set Up Evolution Processes: Verify that the evolution processes are correctly set up and can generate valid data.

Here is an example of how you might call the generate_with_langchain_docs method:

from langchain_core.documents import Document as LCDocument
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# Assuming you have initialized the necessary components
generator_llm = azure_model  # Your LangchainLLM instance
critic_llm = azure_model  # Your LangchainLLM instance
embeddings = azure_embeddings  # Your LangchainEmbeddings instance
documents = [...]  # Your list of LCDocument instances

# Initialize the TestsetGenerator
testset_generator = TestsetGenerator.from_langchain(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings
)

# Generate test data
try:
    test_dataset = testset_generator.generate_with_langchain_docs(
        documents=documents,
        test_size=10,  # Example test size
        distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
        with_debugging_logs=True,
        is_async=True,
        raise_exceptions=True
    )
except ExceptionInRunner:
    print("Failed to generate test data.")

Ensure that all components (generator_llm, critic_llm, embeddings, and documents) are correctly initialized and valid [1].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 17 '24 07:09 dosubot[bot]

@dosu still getting same error


ExceptionInRunner Traceback (most recent call last) in <cell line: 5>() 3 ) 4 ----> 5 testset = generator.generate_with_langchain_docs( 6 documents, 7 test_size=10,

2 frames /usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate_with_langchain_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 204 distributions = distributions or {} 205 # chunk documents and add to docstore --> 206 self.docstore.add_documents( 207 [Document.from_langchain_document(doc) for doc in documents] 208 )

/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_documents(self, docs, show_progress) 212 for d in self.splitter.transform_documents(docs) 213 ] --> 214 self.add_nodes(nodes, show_progress=show_progress) 215 216 def add_nodes(self, nodes: t.Sequence[Node], show_progress=True):

/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_nodes(self, nodes, show_progress) 251 results = executor.results() 252 if not results: --> 253 raise ExceptionInRunner() 254 255 for i, n in enumerate(nodes):

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass raise_exceptions=False incase you want to show only a warning message instead.

wanjeakshay avatar Sep 17 '24 08:09 wanjeakshay

To resolve the ExceptionInRunner error when generating test data using the generate_with_langchain_docs method, ensure the following:

  1. Valid Documents: Verify that the documents provided are valid Document objects from the langchain_core.documents module.
  2. Distributions Sum to 1.0: Ensure that the distributions parameter sums to 1.0.
  3. Correctly Initialized Docstore: Make sure the docstore is correctly initialized and contains the necessary documents.
  4. Set Up Evolution Processes: Verify that the evolution processes are correctly set up and can generate valid data.

Here is an example of how you might call the generate_with_langchain_docs method:

from langchain_core.documents import Document as LCDocument
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# Assuming you have initialized the necessary components
generator_llm = azure_model  # Your LangchainLLM instance
critic_llm = azure_model  # Your LangchainLLM instance
embeddings = azure_embeddings  # Your LangchainEmbeddings instance
documents = [...]  # Your list of LCDocument instances

# Initialize the TestsetGenerator
testset_generator = TestsetGenerator.from_langchain(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings
)

# Generate test data
try:
    test_dataset = testset_generator.generate_with_langchain_docs(
        documents=documents,
        test_size=10,  # Example test size
        distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
        with_debugging_logs=True,
        is_async=True,
        raise_exceptions=True
    )
except ExceptionInRunner:
    print("Failed to generate test data.")

Ensure that all components (generator_llm, critic_llm, embeddings, and documents) are correctly initialized and valid [1].

Additionally, the ExceptionInRunner is raised if the executor does not return any results after attempting to extract keyphrases for nodes that initially lacked them. Ensure that the keyphrase extraction process is functioning correctly and that the documents contain extractable content [2].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 17 '24 08:09 dosubot[bot]

@dosu can you give me list of installed libraries and langchain supported version

wanjeakshay avatar Sep 17 '24 08:09 wanjeakshay

@dosu can we pass pdf file for test data generation?

wanjeakshay avatar Sep 18 '24 12:09 wanjeakshay

Yes, you can pass a PDF file for test data generation. First, you need to load the PDF as a document using a suitable PDF loader. Here's an example of how to do this:

  1. Install Required Libraries: Ensure you have the necessary libraries installed. You might need PyMuPDF or pdfminer.six for PDF processing.

    pip install pymupdf pdfminer.six
    
  2. Load PDF and Generate Test Data: Use the following code to load a PDF file and generate test data:

    from langchain_core.documents import Document as LCDocument
    from ragas.testset.generator import TestsetGenerator
    from ragas.testset.evolutions import simple, reasoning, multi_context
    import fitz  # PyMuPDF
    
    # Function to load PDF and convert to LCDocument
    def load_pdf_as_document(file_path):
        doc = fitz.open(file_path)
        text = ""
        for page in doc:
            text += page.get_text()
        return LCDocument(content=text)
    
    # Load your PDF file
    pdf_path = "path_to_your_pdf_file.pdf"
    document = load_pdf_as_document(pdf_path)
    documents = [document]
    
    # Initialize your Azure model and embeddings
    azure_model = AzureChatOpenAI(
        openai_api_version="2023-05-15",
        azure_endpoint="https://<your-endpoint>.openai.azure.com/",
        azure_deployment="your-deployment-name",
        model="your-model-name",
        validate_base_url=False,
    )
    
    azure_embeddings = AzureOpenAIEmbeddings(
        openai_api_version="2023-05-15",
        azure_endpoint="https://<your-endpoint>.openai.azure.com/",
        azure_deployment="your-deployment-name",
        model="text-embedding-ada-002",
    )
    
    # Create the test set generator
    generator = TestsetGenerator.from_langchain(
        generator_llm=azure_model, critic_llm=azure_model, embeddings=azure_embeddings
    )
    
    # Generate the test set
    try:
        testset = generator.generate_with_langchain_docs(
            documents=documents,
            test_size=10,
            distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
            with_debugging_logs=True,
            is_async=True,
            raise_exceptions=True
        )
    except ExceptionInRunner:
        print("Failed to generate test data.")
    
    # Print results
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost (USD): ${cb.total_cost}")
    

This code snippet demonstrates how to load a PDF file, convert it into a Document object, and then generate test data using the generate_with_langchain_docs method. Ensure that all components (generator_llm, critic_llm, embeddings, and documents) are correctly initialized and valid.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 18 '24 12:09 dosubot[bot]

@wanjeakshay were you able to resolve this? I have followed all the guidance above but still getting the same error.

asanyaga avatar Sep 19 '24 10:09 asanyaga

@asanyaga No. still getting same error for pdf. I converted that file into text its working but the total cost consumption for this test data generation is 400k tokens for 3 page txt file.

wanjeakshay avatar Sep 20 '24 09:09 wanjeakshay

@dosu how can i pass text as input and create documents for creating synthetic text data generation.

wanjeakshay avatar Sep 20 '24 09:09 wanjeakshay

To pass text as input and create documents for generating synthetic text data, you can follow these steps:

  1. Load Documents: Use the LlamaIndex document loader to load documents from a directory.
  2. Set Metadata: Ensure each document has a metadata dictionary with a filename key.
  3. Generate Synthetic Data: Use Ragas' TestsetGenerator to generate synthetic Question/Context/Ground_Truth samples.

Here is the complete code to achieve this:

import os
from llama_index import SimpleDirectoryReader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-openai-key"

# Load documents from directory
dir_path = "./your-directory"
reader = SimpleDirectoryReader(dir_path)
documents = reader.load_data()

# Ensure each document has a 'filename' in its metadata
for document in documents:
    document.metadata['filename'] = document.metadata.get('source', 'unknown')

# Initialize the generator with OpenAI models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Change resulting question type distribution
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

# Generate synthetic test set
testset = generator.generate_with_llamaindex_docs(documents, test_size=10, distributions=distributions)

# Export the results to a Pandas DataFrame
df = testset.to_pandas()

This code will load your documents, set the necessary metadata, generate synthetic data, and export the results to a Pandas DataFrame. By using Ragas' evolutionary generation paradigm, you can create diverse and complex question types such as reasoning, conditioning, and multi-context, ensuring a comprehensive evaluation of your RAG pipeline [1][2][3].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 20 '24 09:09 dosubot[bot]

Hi @dosu thank you for sharing the complete code, and I'm testing it but still got the same error without any additional useful info for debugging.

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Change resulting question type distribution
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}
# use generator.generate_with_llamaindex_docs if you use llama-index as document loader
try:
    testset = generator.generate_with_langchain_docs(
        documents=documents[:50], 
        test_size=10, 
        distributions=distributions, 
        with_debugging_logs=True,
        is_async=True,
        raise_exceptions=True
    ) 
    testset.to_pandas()
    print(testset)
except Exception as e:
    print(e)

and the only difference is that I'm using generate_with_langchain_docs, could you please provide more guidance on how to resolve this issue? Thanks.

Josh-Chen-CanaDREAM avatar Oct 02 '24 17:10 Josh-Chen-CanaDREAM