ragas Do we need to chunk documents before text set generation?

The embedding model is used for TestsetGenerator:

generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embedding_model)
dataset = generator.generate_with_langchain_docs(documents, test_size=100, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

Does this mean we have to chunk the documents to make documents shorter than the max_seq_length of the embedding model?

Jul 14 '24 16:07 hanfei1986

@hanfei1986 Hello there! I'm here to help you with any bugs, questions, or contributions while we wait for a human maintainer. Let's squash those bugs together! 🤖

Yes, you need to chunk documents before text set generation when using the embedding model for TestsetGenerator. This ensures that the documents do not exceed the max_seq_length of the embedding model. The TokenTextSplitter is typically used for this purpose:

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)

This splitter will chunk the documents into smaller pieces before they are processed and stored in the InMemoryDocumentStore ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Jul 14 '24 17:07 dosubot[bot]

@hanfei1986 @dosu was wrong here, don't chunk the documents. We do some chunking for the documents ourselves in a way that makes sense for the testset generation module

Jul 30 '24 06:07 jjmachan

You do not need to chunk the documents before text set generation. The TestsetGenerator module handles the chunking of documents internally in a way that makes sense for the test set generation process. This means you can pass your documents directly to the generate_with_langchain_docs method without worrying about their length relative to the max_seq_length of the embedding model.

To continue talking to Dosu, mention @dosu.

Jul 30 '24 06:07 dosubot[bot]

It seems the issue was answered, closing this now.

Oct 29 '25 08:10 github-actions[bot]