langchainjs icon indicating copy to clipboard operation
langchainjs copied to clipboard

Feature: Enhanced Document Splitting with Contextual Chunk Headers

Open mryraghi opened this issue 2 years ago • 4 comments

This PR enhances the user's selected splitting strategy by enabling the addition of a chunk header to each generated chunk, as well as a continuation header for subsequent chunks (e.g., (cont'd)). This enhancement facilitates improved connections between chunks in vector stores and large language models (LLMs) by incorporating a user-defined general context for each chunk.

Use case: Consider a scenario where you want to store a collection of large documents in a vector store and perform Q&A tasks on them. Simply splitting documents with overlapping text may not provide sufficient context for LLMs to determine if multiple chunks are referencing the same information. This PR addresses this issue by allowing users to include additional contextual information in chunk headers.

Example:

const CHUNK_SIZE = 1536;
const chunkHeader = `DOC ID: ${_id}\n\nDOC NAME: ${firstName}\n\n---\n\n`;
const chunkBody = "large text..."

const splitter = new CharacterTextSplitter({
  chunkSize: CHUNK_SIZE,
  chunkOverlap: 100,
  chunkHeader
});

const result = await splitter.createDocuments([chunkBody]);
console.log(result)

Result:

# console.log(result);

[
  {
    "pageContent": "DOC ID: 123\n\nDOC NAME: file.tx\n\n---\n\n<...CHUNK 1...>",
    "metadata": {
      "loc": {
        "lines": {
          "from": 1,
          "to": 49
        }
      }
    }
  },
 {
    "pageContent": "DOC ID: 123\n\nDOC NAME: file.tx\n\n---\n\n(cont'd) <...CHUNK 2...>",
    "metadata": {
      "loc": {
        "lines": {
          "from": 49,
          "to": 66
        }
      }
    }
  },
  ...
]

mryraghi avatar Apr 23 '23 15:04 mryraghi

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview May 23, 2023 5:32pm

vercel[bot] avatar Apr 23 '23 15:04 vercel[bot]

Would it be better to just add this information into the metadata field? And then when taking from the vector store you can just filter by the metadata before the similarity search?

justindra avatar Apr 25 '23 18:04 justindra

Imagine you want to save a bunch of books from a library into a vector store and then ask questions (through LLMs) such as "Suggest books about topic X." In such a scenario, you wouldn't filter by metadata.bookId as you don't know it in advance. The vector store + LLM would need to understand that some chunks are contextually related while others are not.

By putting headers like these, which seem to be referenced frequently, I've been able to increase the quality of results drastically.

mryraghi avatar Apr 26 '23 13:04 mryraghi

Ah yep, gotcha. Sorry, I misunderstood as your example had the DOC ID and I thought that's what you meant. Topic makes more sense as an example.

justindra avatar Apr 27 '23 05:04 justindra

This looks really cool! Sorry for the wait.

I'd love to do a little experimenting and comparing as well - one thing I'm curious about is the benefit of the chunkOverlapHeader? How does adding/removing it affect results?

Would also love to have a more concrete example use-case (like the one you have above in this PR) added to the docs - it's a bit abstract right now.

jacoblee93 avatar May 19 '23 18:05 jacoblee93

Moved the chunk header into the createDocuments method itself @mryraghi as I think a common use case would be re-using a text splitter for multiple documents rather than needing to initialize a new one each time.

This is really cool though, thank you!

jacoblee93 avatar May 23 '23 00:05 jacoblee93

Thanks for looking into this and further improving it!

mryraghi avatar May 23 '23 07:05 mryraghi