BadRequestError: Unsupported data type while using generator using testset generation
I have checked the https://github.com/explodinggradients/ragas/blob/main/docs/howtos/integrations/llamaindex.ipynb and related resources and couldn't resolve my bug.
When I try to generate the testset from the text file using generator, it gives the following error:
BadRequestError: Unsupported data type
Ragas version: 0.1.21 Python version: 3.12.2
Code to Reproduce testset = generator.generate_with_llamaindex_docs( documents, test_size=50, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, )
Error trace BadRequestError Traceback (most recent call last) Cell In[17], line 2 1 # generate testset ----> 2 testset = generator.generate_with_llamaindex_docs( 3 documents, 4 test_size=50, 5 distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, 6 )
File c:\Users\jothimalar.p.lv\Documents\LLM POCS\TestSet generation\testsetenv\Lib\site-packages\ragas\testset\generator.py:179, in TestsetGenerator.generate_with_llamaindex_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 177 distributions = distributions or {} 178 # chunk documents and add to docstore --> 179 self.docstore.add_documents( 180 [Document.from_llamaindex_document(doc) for doc in documents] 181 ) 183 return self.generate( 184 test_size=test_size, 185 distributions=distributions, (...) 189 raise_exceptions=raise_exceptions, 190 )
File c:\Users\jothimalar.p.lv\Documents\LLM POCS\TestSet generation\testsetenv\Lib\site-packages\ragas\testset\docstore.py:214, in InMemoryDocumentStore.add_documents(self, docs, show_progress) 209 # split documents with self.splitter into smaller nodes ... (...) 1641 retries_taken=retries_taken, 1642 )
Expected behavior It should generate the testset: question and answer pair
Hey there, @Jothimalar1997! I'm here to help you with bugs, questions, and more. Let's tackle this issue together!
The BadRequestError: Unsupported data type error usually occurs when the documents are not in the expected format. For the generate_with_llamaindex_docs function, ensure that your documents are loaded correctly using a loader from the llama_index library, such as SemanticScholarReader.
Here's a quick guide to ensure your documents are in the correct format:
-
Load Documents Properly: Use a loader like
SemanticScholarReaderto load your documents. Here's an example:from llama_index.core import download_loader def get_documents(): SemanticScholarReader = download_loader("SemanticScholarReader") loader = SemanticScholarReader() query_space = "large language models" documents = loader.load_data(query=query_space, limit=10) return documents documents = get_documents() -
Pass Documents to the Function: Ensure that the documents are passed correctly to the
generate_with_llamaindex_docsfunction:testset = generator.generate_with_llamaindex_docs( documents=documents, test_size=50, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, ) -
Check Document Transformation: Ensure that the
Document.from_llamaindex_documentmethod is correctly transforming your documents. If there are specific requirements for this transformation, make sure they are met.
If you've followed these steps and the error persists, double-check the document structure and ensure all dependencies are correctly installed and compatible with your Ragas version [1].
To continue talking to Dosu, mention @dosu.
@dosu How can I load my custom pdf files using SemanticScholarReader since it is used for querying academic related paper right?
I need to upload my own pdf files and to generate testset (question + answer pair) from those files using generator from ragas.
While implementing the above steps only, It generate the above mentioned error
To load your custom PDF files using SemanticScholarReader, you can follow these steps:
-
Load Your Custom PDF Files:
If
SemanticScholarReaderis not suitable for non-academic PDFs, you might need to use a different loader that supports custom PDFs. However, if you still want to try usingSemanticScholarReader, ensure your query is set to match your documents. -
Generate a Test Set:
Once your documents are loaded, you can generate a test set using the generator. Here's a general approach:
from llama_index import download_loader from ragas.testset.generator import TestsetGenerator from ragas.testset.evolutions import simple, reasoning, multi_context from langchain_openai import ChatOpenAI, OpenAIEmbeddings # Load documents SemanticScholarReader = download_loader("SemanticScholarReader") loader = SemanticScholarReader() documents = loader.load_data(query="your_query", full_text=True, limit=10) # Initialize the generator generator_llm = ChatOpenAI(model="gpt-4o-mini") critic_llm = ChatOpenAI(model="gpt-4o") embeddings = OpenAIEmbeddings() generator = TestsetGenerator.from_langchain( generator_llm, critic_llm, embeddings ) # Define question type distribution distributions = { simple: 0.5, multi_context: 0.4, reasoning: 0.1 } # Generate the test set testset = generator.generate_with_langchain_docs(documents, 10, distributions) testset_df = testset.to_pandas()
If SemanticScholarReader doesn't work for your custom PDFs, consider using a different document loader that supports your file type. Ensure your documents are in the correct format before passing them to the generator [1].
To continue talking to Dosu, mention @dosu.
@dosu can you provide examples with other document loader rather than SimpleDirectoryReader?
I couldn't find specific examples in the Ragas documentation for using document loaders other than SimpleDirectoryReader. However, you can consider using libraries like PyPDF2 or pdfplumber to load PDF files and then convert them into a format compatible with Ragas.
Here's a basic example using PyPDF2:
import PyPDF2
def load_pdf(file_path):
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ''
for page in reader.pages:
text += page.extract_text()
return text
# Example usage
pdf_text = load_pdf('your_file.pdf')
documents = [{'page_content': pdf_text, 'metadata': {'source': 'your_file.pdf'}}]
# Now you can pass `documents` to your Ragas generator function
This approach extracts text from a PDF and structures it in a way that can be used with Ragas' document processing functions. Adjust the metadata as needed for your use case.
To continue talking to Dosu, mention @dosu.
I am able to fix this issue by updating embedding model api.
Ref - https://github.com/run-llama/llama_index/issues/16488