ragas BadRequestError: Unsupported data type while using generator using testset generation

I have checked the https://github.com/explodinggradients/ragas/blob/main/docs/howtos/integrations/llamaindex.ipynb and related resources and couldn't resolve my bug.

When I try to generate the testset from the text file using generator, it gives the following error:

BadRequestError: Unsupported data type

Ragas version: 0.1.21 Python version: 3.12.2

Code to Reproduce testset = generator.generate_with_llamaindex_docs( documents, test_size=50, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, )

Error trace BadRequestError Traceback (most recent call last) Cell In[17], line 2 1 # generate testset ----> 2 testset = generator.generate_with_llamaindex_docs( 3 documents, 4 test_size=50, 5 distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}, 6 )

File c:\Users\jothimalar.p.lv\Documents\LLM POCS\TestSet generation\testsetenv\Lib\site-packages\ragas\testset\generator.py:179, in TestsetGenerator.generate_with_llamaindex_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 177 distributions = distributions or {} 178 # chunk documents and add to docstore --> 179 self.docstore.add_documents( 180 [Document.from_llamaindex_document(doc) for doc in documents] 181 ) 183 return self.generate( 184 test_size=test_size, 185 distributions=distributions, (...) 189 raise_exceptions=raise_exceptions, 190 )

File c:\Users\jothimalar.p.lv\Documents\LLM POCS\TestSet generation\testsetenv\Lib\site-packages\ragas\testset\docstore.py:214, in InMemoryDocumentStore.add_documents(self, docs, show_progress) 209 # split documents with self.splitter into smaller nodes ... (...) 1641 retries_taken=retries_taken, 1642 )

Expected behavior It should generate the testset: question and answer pair

Oct 09 '24 06:10 Jothimalar1997

Hey there, @Jothimalar1997! I'm here to help you with bugs, questions, and more. Let's tackle this issue together!

The BadRequestError: Unsupported data type error usually occurs when the documents are not in the expected format. For the generate_with_llamaindex_docs function, ensure that your documents are loaded correctly using a loader from the llama_index library, such as SemanticScholarReader.

Here's a quick guide to ensure your documents are in the correct format:

Load Documents Properly: Use a loader like SemanticScholarReader to load your documents. Here's an example:

from llama_index.core import download_loader

def get_documents():
    SemanticScholarReader = download_loader("SemanticScholarReader")
    loader = SemanticScholarReader()
    query_space = "large language models"
    documents = loader.load_data(query=query_space, limit=10)
    return documents

documents = get_documents()

Pass Documents to the Function: Ensure that the documents are passed correctly to the generate_with_llamaindex_docs function:

testset = generator.generate_with_llamaindex_docs(
    documents=documents,
    test_size=50,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

Check Document Transformation: Ensure that the Document.from_llamaindex_document method is correctly transforming your documents. If there are specific requirements for this transformation, make sure they are met.

If you've followed these steps and the error persists, double-check the document structure and ensure all dependencies are correctly installed and compatible with your Ragas version ^[1].

To continue talking to Dosu, mention @dosu.

Oct 09 '24 06:10 dosubot[bot]

@dosu How can I load my custom pdf files using SemanticScholarReader since it is used for querying academic related paper right?

I need to upload my own pdf files and to generate testset (question + answer pair) from those files using generator from ragas.

While implementing the above steps only, It generate the above mentioned error

Oct 09 '24 10:10 Jothimalar1997

To load your custom PDF files using SemanticScholarReader, you can follow these steps:

Load Your Custom PDF Files:

If SemanticScholarReader is not suitable for non-academic PDFs, you might need to use a different loader that supports custom PDFs. However, if you still want to try using SemanticScholarReader, ensure your query is set to match your documents.

Generate a Test Set:

Once your documents are loaded, you can generate a test set using the generator. Here's a general approach:

from llama_index import download_loader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Load documents
SemanticScholarReader = download_loader("SemanticScholarReader")
loader = SemanticScholarReader()
documents = loader.load_data(query="your_query", full_text=True, limit=10)

# Initialize the generator
generator_llm = ChatOpenAI(model="gpt-4o-mini")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Define question type distribution
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

# Generate the test set
testset = generator.generate_with_langchain_docs(documents, 10, distributions)
testset_df = testset.to_pandas()

If SemanticScholarReader doesn't work for your custom PDFs, consider using a different document loader that supports your file type. Ensure your documents are in the correct format before passing them to the generator ^[1].

To continue talking to Dosu, mention @dosu.

Oct 09 '24 10:10 dosubot[bot]

@dosu can you provide examples with other document loader rather than SimpleDirectoryReader?

Oct 09 '24 10:10 Jothimalar1997

I couldn't find specific examples in the Ragas documentation for using document loaders other than SimpleDirectoryReader. However, you can consider using libraries like PyPDF2 or pdfplumber to load PDF files and then convert them into a format compatible with Ragas.

Here's a basic example using PyPDF2:

import PyPDF2

def load_pdf(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page in reader.pages:
            text += page.extract_text()
    return text

# Example usage
pdf_text = load_pdf('your_file.pdf')
documents = [{'page_content': pdf_text, 'metadata': {'source': 'your_file.pdf'}}]

# Now you can pass `documents` to your Ragas generator function

This approach extracts text from a PDF and structures it in a way that can be used with Ragas' document processing functions. Adjust the metadata as needed for your use case.

To continue talking to Dosu, mention @dosu.

Oct 09 '24 10:10 dosubot[bot]

I am able to fix this issue by updating embedding model api.

Ref - https://github.com/run-llama/llama_index/issues/16488

Oct 14 '24 05:10 Jothimalar1997