ragas
ragas copied to clipboard
fix(testset): Ensure each document is used only once for question gen…
…eration
Previously, the code used a nested loop to iterate over the distributions and generate questions for each document. However, this approach had a potential issue where a single document could be used multiple times for generating questions, leading to redundancy and inefficient usage of the available documents.
To address this issue, the code has been modified to use a cumulative approach for determining the range of documents assigned to each evolution type based on their probability distribution. The key changes include:
- Introduced a
start_indexvariable to keep track of the starting document index for each evolution type. - Calculated the
end_indexfor each evolution type by adding the rounded value ofprobability * test_sizeto thestart_index. - Used an inner loop to iterate from
start_indextoend_indexand submit tasks to the executor for each document within that range. - Updated the
start_indextoend_indexafter processing each evolution type to ensure the next evolution type starts from the correct position. - If
total_evolutionsis less thantest_sizeafter processing all evolution types, randomly selected evolution types to fill the remaining documents using thechoicesfunction.
With these modifications, each document is guaranteed to be used only once for question generation, avoiding redundancy and ensuring efficient utilization of the available documents. The cumulative probability approach ensures that the document ranges for different evolution types do not overlap, maintaining the desired probability distribution.
This fix improves the quality and diversity of the generated questions by preventing the repeated use of documents and ensuring a more balanced distribution of questions across the available documents.
It's quite strange that my code changes do not involve ragas/llms/base.py, but it seems that "ChatVertexAI" is not exported from the module "langchain_community.chat_models". The error message suggests importing it from "langchain_community.chat_models.vertexai" instead.
each document is guaranteed to be used only once for question generation
Do you mean document as in files or the node/embeddings?
each document is guaranteed to be used only once for question generation
Do you mean document as in files or the node/embeddings?
I mean current_nodes, current_nodes was initialized from:
current_nodes = [
CurrentNodes(root_node=n, nodes=[n])
for n in self.docstore.get_random_nodes(k=test_size)
]
And each time it will use the index between 0 to (probability * test_size), so in every distributions, it will always use the the front part current_nodes
The general idea seems viable, though I'm not sure if explicit requirement to use each document only once is really what we need. I can easily imagine that for longer documents you may need
- create more than one question verifying various topics from the document to be possible retrieved properly (e.g. I had a case recently with my RAG that due to lossy semantic chunking - more like a extensive summaries - some data were missing in the actual app)
- the multi-context questions obviously should be able to refer to previously used documents
I can easily imagine that for longer documents you may need
By document, they mean the current_nodes. I think the length of the document/file is irrelevant as it has been embedded into nodes. Here the node is used only once, for question generation. This node can still be used in other questions for finding relevant question.
I'm not sure if explicit requirement to use each document only once is really what we need.
I think you're right. But this is a recurring issue in the generated datasets. A few questions are similar with only the phrasing/wording changed. I was looking into a method such that nodes used for generating seed questions are not used again for generating, although they can be used as context for other questions. This approach seems viable though
Hey guys, first of all - apologies for late reply @princepride @omkar-334 @ciekawy This is an interesting issue. I have noticed it before, that is when I implemented penalizing the selection of repeated chunks using this logic here
- wins here refers to how many times the node has been used
- An adjustment factor is used to weigh down nodes as they increasingly get selected. On top of that now, I just merged PR #937 which randomizes the selected docs for each evolution.
What do you guys think? I am working on improving test generation this week and would love to chat with any of you guys https://cal.com/shahul-ragas/30min
closing this for now but I'm really sorry we couldn't merge it 🙁 but at the same time thanks a million for taking the time to raise this, really grateful too and do checkout this form https://docs.google.com/forms/d/e/1FAIpQLSdM9FrrZrnpByG4XxuTbcAB-zn-Z7i_a7CsMkgBVOWQjRJckg/viewform - our way of saying thank you 🙂