total bin coverage for default_transform() in Knowledge Graph transformations

Open tolgaerdonmez opened this issue 9 months ago • 1 comments

Problem

default_transform() uses token lengths up to 100k (0-100k interval) and seperates it into three bins. But for longer documents with token length >100k and 0 this function raises the following:

 raise ValueError(
            "Documents appears to be too short (ie 100 tokens or less). Please provide longer documents."
        )

Which covers the case of empty documents but also violates the constraint >100k.

Solution (Currently implemented)

I'm not sure with this solution but my first approach was to change the last bin interval to inf. This solves the problem easily but could be inefficient for very large documents.

    bin_ranges = [(0, 100), (101, 500), (501, float("inf"))]

Better Solution Proposal (Let's discuss this)

If the given document is larger than >100k tokens. Seperate the document in half. And start the transformation again, until it fits into the initial bin sizes.

Mar 05 '25 10:03 tolgaerdonmez

I've found another solution: Seperate the document into half of the total token length using langchains text splitters with overlap. Use the token counting function as the length function used in ragas itself.

Mar 09 '25 14:03 tolgaerdonmez

Hey @tolgaerdonmez! 👋

Hope you're doing well! I really loved your work on improving bin coverage for default_transform() in Knowledge Graph transformations - it's exactly the kind of thoughtful improvement that makes Ragas better for everyone.

Quick question for you - we're trying to figure out what to do with the Testset Generation module as we gear up for v0.4, and since you've been working in this space, I'd love to get your take on it.

Mind checking out this discussion when you have a moment? 🔗 https://github.com/explodinggradients/ragas/issues/2231

Basically we're wondering if we should keep it as part of the core library, spin it off into its own thing, or maybe even retire it if folks aren't really using it much. No pressure at all, but given your experience with knowledge graph transformations and document processing, your perspective would be super helpful!

Just drop a 👍 👎 or 🚀 on the issue, or feel free to share any thoughts you have.

Thanks for being awesome! 🙏

Aug 28 '25 13:08 jjmachan

I've found another solution: Seperate the document into half of the total token length using langchains text splitters with overlap. Use the token counting function as the length function used in ragas itself.

This might be a bit over-engineering, but we can take it up as a discussion in a new issue perhaps. This fix is good to go.

Sep 08 '25 08:09 anistark