ragas
ragas copied to clipboard
Generate test data for 1 pdf
I have load pdf as document, now want to generate test data from it, error is occuring.
ExceptionInRunner Traceback (most recent call last)
2 frames /usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate_with_langchain_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 204 distributions = distributions or {} 205 # chunk documents and add to docstore --> 206 self.docstore.add_documents( 207 [Document.from_langchain_document(doc) for doc in documents] 208 )
/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_documents(self, docs, show_progress) 212 for d in self.splitter.transform_documents(docs) 213 ] --> 214 self.add_nodes(nodes, show_progress=show_progress) 215 216 def add_nodes(self, nodes: t.Sequence[Node], show_progress=True):
/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_nodes(self, nodes, show_progress) 251 results = executor.results() 252 if not results: --> 253 raise ExceptionInRunner() 254 255 for i, n in enumerate(nodes):
ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass raise_exceptions=False incase you want to show only a warning message instead.
code which i am running with get_openai_callback() as cb: generator = TestsetGenerator.from_langchain( generator_llm=azure_model, critic_llm=azure_model, embeddings=azure_embeddings )
testset = generator.generate_with_langchain_docs( documents, test_size=10, raise_exceptions=True, # Change this to True to see the original exception with_debugging_logs=False, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, )
print(f"Total Tokens: {cb.total_tokens}") print(f"Completion Tokens: {cb.completion_tokens}") print(f"Total Cost (USD): ${cb.total_cost}")
To generate test data from a loaded PDF document using the generate_with_langchain_docs method without encountering the ExceptionInRunner error, you should set the raise_exceptions parameter to False. This will handle exceptions internally and prevent the error from propagating. Here is an example of how you can modify your code:
from langchain_community.document_loaders import DirectoryLoader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
# Assuming `documents` is a list of LCDocument objects created from the loaded PDF
loader = DirectoryLoader(
"./path_to_your_pdfs/", use_multithreading=True, silent_errors=True, sample_size=1
)
documents = loader.load()
# Initialize your Azure model and embeddings
azure_model = AzureChatOpenAI(
openai_api_version="2023-05-15",
azure_endpoint="https://<your-endpoint>.openai.azure.com/",
azure_deployment="your-deployment-name",
model="your-model-name",
validate_base_url=False,
)
azure_embeddings = AzureOpenAIEmbeddings(
openai_api_version="2023-05-15",
azure_endpoint="https://<your-endpoint>.openai.azure.com/",
azure_deployment="your-deployment-name",
model="text-embedding-ada-002",
)
# Create the test set generator
generator = TestsetGenerator.from_langchain(
generator_llm=azure_model, critic_llm=azure_model, embeddings=azure_embeddings
)
# Generate the test set
test_data = generator.generate_with_langchain_docs(
documents=documents,
test_size=10, # Example test size
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
with_debugging_logs=False,
is_async=True,
raise_exceptions=False, # Set to False to avoid ExceptionInRunner error
run_config=None # Or provide specific run configuration if needed
)
# Convert to pandas DataFrame if needed
test_data_df = test_data.to_pandas()
print(f"Total Tokens: {cb.total_tokens}")
print(f"Completion Tokens: {cb.completion_tokens}")
print(f"Total Cost (USD): ${cb.total_cost}")
By setting raise_exceptions to False, the method will handle exceptions internally and prevent the ExceptionInRunner error from propagating [1][2].
To continue talking to Dosu, mention @dosu.
@dosu still getting an error
ExceptionInRunner Traceback (most recent call last)
2 frames /usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate_with_langchain_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 204 distributions = distributions or {} 205 # chunk documents and add to docstore --> 206 self.docstore.add_documents( 207 [Document.from_langchain_document(doc) for doc in documents] 208 )
/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_documents(self, docs, show_progress) 212 for d in self.splitter.transform_documents(docs) 213 ] --> 214 self.add_nodes(nodes, show_progress=show_progress) 215 216 def add_nodes(self, nodes: t.Sequence[Node], show_progress=True):
/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_nodes(self, nodes, show_progress) 251 results = executor.results() 252 if not results: --> 253 raise ExceptionInRunner() 254 255 for i, n in enumerate(nodes):
ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass raise_exceptions=False incase you want to show only a warning message instead.
To resolve the ExceptionInRunner error when generating test data using the generate_with_langchain_docs method, ensure that the test_data_rows list is not empty after the execution of the exec.results() method. This error is raised if test_data_rows is empty, indicating that no valid test data was generated.
Here are some steps to troubleshoot and resolve the issue:
-
Ensure Valid Documents: Verify that the documents provided to the
generate_with_langchain_docsmethod are valid and properly formatted. -
Check Distributions: Ensure that the
distributionsparameter sums to 1.0. -
Initialize Docstore Correctly: Make sure that the
docstoreis correctly initialized and contains the necessary documents. -
Set Up Evolution Processes: Verify that the evolution processes are correctly set up and can generate valid data.
Here is an example of how you might call the generate_with_langchain_docs method:
from langchain_core.documents import Document as LCDocument
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
# Assuming you have initialized the necessary components
generator_llm = azure_model # Your LangchainLLM instance
critic_llm = azure_model # Your LangchainLLM instance
embeddings = azure_embeddings # Your LangchainEmbeddings instance
documents = [...] # Your list of LCDocument instances
# Initialize the TestsetGenerator
testset_generator = TestsetGenerator.from_langchain(
generator_llm=generator_llm,
critic_llm=critic_llm,
embeddings=embeddings
)
# Generate test data
try:
test_dataset = testset_generator.generate_with_langchain_docs(
documents=documents,
test_size=10, # Example test size
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
with_debugging_logs=True,
is_async=True,
raise_exceptions=True
)
except ExceptionInRunner:
print("Failed to generate test data.")
Ensure that all components (generator_llm, critic_llm, embeddings, and documents) are correctly initialized and valid [1].
To continue talking to Dosu, mention @dosu.
@dosu still getting same error
ExceptionInRunner Traceback (most recent call last)
2 frames /usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate_with_langchain_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 204 distributions = distributions or {} 205 # chunk documents and add to docstore --> 206 self.docstore.add_documents( 207 [Document.from_langchain_document(doc) for doc in documents] 208 )
/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_documents(self, docs, show_progress) 212 for d in self.splitter.transform_documents(docs) 213 ] --> 214 self.add_nodes(nodes, show_progress=show_progress) 215 216 def add_nodes(self, nodes: t.Sequence[Node], show_progress=True):
/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in add_nodes(self, nodes, show_progress) 251 results = executor.results() 252 if not results: --> 253 raise ExceptionInRunner() 254 255 for i, n in enumerate(nodes):
ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass raise_exceptions=False incase you want to show only a warning message instead.
To resolve the ExceptionInRunner error when generating test data using the generate_with_langchain_docs method, ensure the following:
- Valid Documents: Verify that the documents provided are valid
Documentobjects from thelangchain_core.documentsmodule. - Distributions Sum to 1.0: Ensure that the
distributionsparameter sums to 1.0. - Correctly Initialized Docstore: Make sure the
docstoreis correctly initialized and contains the necessary documents. - Set Up Evolution Processes: Verify that the evolution processes are correctly set up and can generate valid data.
Here is an example of how you might call the generate_with_langchain_docs method:
from langchain_core.documents import Document as LCDocument
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
# Assuming you have initialized the necessary components
generator_llm = azure_model # Your LangchainLLM instance
critic_llm = azure_model # Your LangchainLLM instance
embeddings = azure_embeddings # Your LangchainEmbeddings instance
documents = [...] # Your list of LCDocument instances
# Initialize the TestsetGenerator
testset_generator = TestsetGenerator.from_langchain(
generator_llm=generator_llm,
critic_llm=critic_llm,
embeddings=embeddings
)
# Generate test data
try:
test_dataset = testset_generator.generate_with_langchain_docs(
documents=documents,
test_size=10, # Example test size
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
with_debugging_logs=True,
is_async=True,
raise_exceptions=True
)
except ExceptionInRunner:
print("Failed to generate test data.")
Ensure that all components (generator_llm, critic_llm, embeddings, and documents) are correctly initialized and valid [1].
Additionally, the ExceptionInRunner is raised if the executor does not return any results after attempting to extract keyphrases for nodes that initially lacked them. Ensure that the keyphrase extraction process is functioning correctly and that the documents contain extractable content [2].
To continue talking to Dosu, mention @dosu.
@dosu can you give me list of installed libraries and langchain supported version
@dosu can we pass pdf file for test data generation?
Yes, you can pass a PDF file for test data generation. First, you need to load the PDF as a document using a suitable PDF loader. Here's an example of how to do this:
-
Install Required Libraries: Ensure you have the necessary libraries installed. You might need
PyMuPDForpdfminer.sixfor PDF processing.pip install pymupdf pdfminer.six -
Load PDF and Generate Test Data: Use the following code to load a PDF file and generate test data:
from langchain_core.documents import Document as LCDocument from ragas.testset.generator import TestsetGenerator from ragas.testset.evolutions import simple, reasoning, multi_context import fitz # PyMuPDF # Function to load PDF and convert to LCDocument def load_pdf_as_document(file_path): doc = fitz.open(file_path) text = "" for page in doc: text += page.get_text() return LCDocument(content=text) # Load your PDF file pdf_path = "path_to_your_pdf_file.pdf" document = load_pdf_as_document(pdf_path) documents = [document] # Initialize your Azure model and embeddings azure_model = AzureChatOpenAI( openai_api_version="2023-05-15", azure_endpoint="https://<your-endpoint>.openai.azure.com/", azure_deployment="your-deployment-name", model="your-model-name", validate_base_url=False, ) azure_embeddings = AzureOpenAIEmbeddings( openai_api_version="2023-05-15", azure_endpoint="https://<your-endpoint>.openai.azure.com/", azure_deployment="your-deployment-name", model="text-embedding-ada-002", ) # Create the test set generator generator = TestsetGenerator.from_langchain( generator_llm=azure_model, critic_llm=azure_model, embeddings=azure_embeddings ) # Generate the test set try: testset = generator.generate_with_langchain_docs( documents=documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, with_debugging_logs=True, is_async=True, raise_exceptions=True ) except ExceptionInRunner: print("Failed to generate test data.") # Print results print(f"Total Tokens: {cb.total_tokens}") print(f"Completion Tokens: {cb.completion_tokens}") print(f"Total Cost (USD): ${cb.total_cost}")
This code snippet demonstrates how to load a PDF file, convert it into a Document object, and then generate test data using the generate_with_langchain_docs method. Ensure that all components (generator_llm, critic_llm, embeddings, and documents) are correctly initialized and valid.
To continue talking to Dosu, mention @dosu.
@wanjeakshay were you able to resolve this? I have followed all the guidance above but still getting the same error.
@asanyaga No. still getting same error for pdf. I converted that file into text its working but the total cost consumption for this test data generation is 400k tokens for 3 page txt file.
@dosu how can i pass text as input and create documents for creating synthetic text data generation.
To pass text as input and create documents for generating synthetic text data, you can follow these steps:
- Load Documents: Use the LlamaIndex document loader to load documents from a directory.
- Set Metadata: Ensure each document has a metadata dictionary with a
filenamekey. - Generate Synthetic Data: Use Ragas'
TestsetGeneratorto generate syntheticQuestion/Context/Ground_Truthsamples.
Here is the complete code to achieve this:
import os
from llama_index import SimpleDirectoryReader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-openai-key"
# Load documents from directory
dir_path = "./your-directory"
reader = SimpleDirectoryReader(dir_path)
documents = reader.load_data()
# Ensure each document has a 'filename' in its metadata
for document in documents:
document.metadata['filename'] = document.metadata.get('source', 'unknown')
# Initialize the generator with OpenAI models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings
)
# Change resulting question type distribution
distributions = {
simple: 0.5,
multi_context: 0.4,
reasoning: 0.1
}
# Generate synthetic test set
testset = generator.generate_with_llamaindex_docs(documents, test_size=10, distributions=distributions)
# Export the results to a Pandas DataFrame
df = testset.to_pandas()
This code will load your documents, set the necessary metadata, generate synthetic data, and export the results to a Pandas DataFrame. By using Ragas' evolutionary generation paradigm, you can create diverse and complex question types such as reasoning, conditioning, and multi-context, ensuring a comprehensive evaluation of your RAG pipeline [1][2][3].
To continue talking to Dosu, mention @dosu.
Hi @dosu thank you for sharing the complete code, and I'm testing it but still got the same error without any additional useful info for debugging.
# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings
)
# Change resulting question type distribution
distributions = {
simple: 0.5,
multi_context: 0.4,
reasoning: 0.1
}
# use generator.generate_with_llamaindex_docs if you use llama-index as document loader
try:
testset = generator.generate_with_langchain_docs(
documents=documents[:50],
test_size=10,
distributions=distributions,
with_debugging_logs=True,
is_async=True,
raise_exceptions=True
)
testset.to_pandas()
print(testset)
except Exception as e:
print(e)
and the only difference is that I'm using generate_with_langchain_docs, could you please provide more guidance on how to resolve this issue? Thanks.