ragas icon indicating copy to clipboard operation
ragas copied to clipboard

fix(testset): Ensure each document is used only once for question gen…

Open princepride opened this issue 1 year ago • 7 comments

…eration

Previously, the code used a nested loop to iterate over the distributions and generate questions for each document. However, this approach had a potential issue where a single document could be used multiple times for generating questions, leading to redundancy and inefficient usage of the available documents.

To address this issue, the code has been modified to use a cumulative approach for determining the range of documents assigned to each evolution type based on their probability distribution. The key changes include:

  1. Introduced a start_index variable to keep track of the starting document index for each evolution type.
  2. Calculated the end_index for each evolution type by adding the rounded value of probability * test_size to the start_index.
  3. Used an inner loop to iterate from start_index to end_index and submit tasks to the executor for each document within that range.
  4. Updated the start_index to end_index after processing each evolution type to ensure the next evolution type starts from the correct position.
  5. If total_evolutions is less than test_size after processing all evolution types, randomly selected evolution types to fill the remaining documents using the choices function.

With these modifications, each document is guaranteed to be used only once for question generation, avoiding redundancy and ensuring efficient utilization of the available documents. The cumulative probability approach ensures that the document ranges for different evolution types do not overlap, maintaining the desired probability distribution.

This fix improves the quality and diversity of the generated questions by preventing the repeated use of documents and ensuring a more balanced distribution of questions across the available documents.

princepride avatar Apr 19 '24 10:04 princepride

It's quite strange that my code changes do not involve ragas/llms/base.py, but it seems that "ChatVertexAI" is not exported from the module "langchain_community.chat_models". The error message suggests importing it from "langchain_community.chat_models.vertexai" instead.

princepride avatar Apr 20 '24 18:04 princepride

each document is guaranteed to be used only once for question generation

Do you mean document as in files or the node/embeddings?

omkar-334 avatar May 15 '24 07:05 omkar-334

each document is guaranteed to be used only once for question generation

Do you mean document as in files or the node/embeddings?

I mean current_nodes, current_nodes was initialized from:

current_nodes = [
            CurrentNodes(root_node=n, nodes=[n])
            for n in self.docstore.get_random_nodes(k=test_size)
        ]

And each time it will use the index between 0 to (probability * test_size), so in every distributions, it will always use the the front part current_nodes

princepride avatar May 15 '24 07:05 princepride

The general idea seems viable, though I'm not sure if explicit requirement to use each document only once is really what we need. I can easily imagine that for longer documents you may need

  • create more than one question verifying various topics from the document to be possible retrieved properly (e.g. I had a case recently with my RAG that due to lossy semantic chunking - more like a extensive summaries - some data were missing in the actual app)
  • the multi-context questions obviously should be able to refer to previously used documents

ciekawy avatar May 15 '24 07:05 ciekawy

I can easily imagine that for longer documents you may need

By document, they mean the current_nodes. I think the length of the document/file is irrelevant as it has been embedded into nodes. Here the node is used only once, for question generation. This node can still be used in other questions for finding relevant question.

omkar-334 avatar May 15 '24 08:05 omkar-334

I'm not sure if explicit requirement to use each document only once is really what we need.

I think you're right. But this is a recurring issue in the generated datasets. A few questions are similar with only the phrasing/wording changed. I was looking into a method such that nodes used for generating seed questions are not used again for generating, although they can be used as context for other questions. This approach seems viable though

omkar-334 avatar May 15 '24 08:05 omkar-334

Hey guys, first of all - apologies for late reply @princepride @omkar-334 @ciekawy This is an interesting issue. I have noticed it before, that is when I implemented penalizing the selection of repeated chunks using this logic here

  • wins here refers to how many times the node has been used
  • An adjustment factor is used to weigh down nodes as they increasingly get selected. On top of that now, I just merged PR #937 which randomizes the selected docs for each evolution.

What do you guys think? I am working on improving test generation this week and would love to chat with any of you guys https://cal.com/shahul-ragas/30min

shahules786 avatar May 21 '24 05:05 shahules786

closing this for now but I'm really sorry we couldn't merge it 🙁 but at the same time thanks a million for taking the time to raise this, really grateful too and do checkout this form https://docs.google.com/forms/d/e/1FAIpQLSdM9FrrZrnpByG4XxuTbcAB-zn-Z7i_a7CsMkgBVOWQjRJckg/viewform - our way of saying thank you 🙂

jjmachan avatar Nov 03 '24 05:11 jjmachan