langchainjs keepSeparator doesn't work with RecursiveCharacterTextSplitter

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain.js documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain.js rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

        
const text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.\nQuisque placerat nibh ex, eleifend varius dolor vulputate nec.\nNunc euismod ligula in leo tempus, vel lobortis leo accumsan. Phasellus elementum, nisi id tincidunt imperdiet, felis massa luctus diam, id scelerisque nisi mi et libero.\nDonec pulvinar turpis eu ipsum ullamcorper scelerisque.\nMauris non nibh nisi. Cras urna nulla, maximus sed mattis vel, gravida vel sapien. Aenean in diam a libero ultrices elementum.\nPellentesque ac ex nunc. Praesent nunc magna, viverra at urna vitae, aliquam placerat libero. Aenean vel turpis porta, semper velit eget, hendrerit ipsum.";

const document = new Document({pageContent: text});

const textSplitter = new RecursiveCharacterTextSplitter({chunkSize: 200, chunkOverlap: 40, keepSeparator: true});

const splitDocs = await textSplitter.splitDocuments([document], {
   chunkHeader: "CHUNK HEADER",
   appendChunkOverlapHeader: true,
});

console.log(JSON.stringify(splitDocs, null, 2));

Error Message and Stack Trace (if applicable)

No response

Description

I get these chunks:

You can see that second chunks starts with CHUNK HEADER(cont'd) Nunc, but I would expect it to start with CHUNK HEADER(cont'd) Nunc OR the first chunk should end with \n.

The \n between vulputate nec.\nNunc euismod is lost.

[
   {
      "pageContent":"CHUNK HEADERLorem ipsum dolor sit amet, consectetur adipiscing elit.\nQuisque placerat nibh ex, eleifend varius dolor vulputate nec.",
      "metadata":{
         "loc":{
            "lines":{
               "from":1,
               "to":2
            }
         }
      }
   },
   {
      "pageContent":"CHUNK HEADER(cont'd) Nunc euismod ligula in leo tempus, vel lobortis leo accumsan. Phasellus elementum, nisi id tincidunt imperdiet, felis massa luctus diam, id scelerisque nisi mi et libero.",
      "metadata":{
         "loc":{
            "lines":{
               "from":3,
               "to":3
            }
         }
      }
   },
   {
      "pageContent":"CHUNK HEADER(cont'd) Donec pulvinar turpis eu ipsum ullamcorper scelerisque.\nMauris non nibh nisi. Cras urna nulla, maximus sed mattis vel, gravida vel sapien. Aenean in diam a libero ultrices elementum.",
      "metadata":{
         "loc":{
            "lines":{
               "from":4,
               "to":5
            }
         }
      }
   },
   {
      "pageContent":"CHUNK HEADER(cont'd) Pellentesque ac ex nunc. Praesent nunc magna, viverra at urna vitae, aliquam placerat libero. Aenean vel turpis porta, semper velit eget, hendrerit ipsum.",
      "metadata":{
         "loc":{
            "lines":{
               "from":6,
               "to":6
            }
         }
      }
   }
]

System Info

[email protected] macOS node 18.20.2

Apr 19 '24 08:04 guidev

It looks like the issue you're encountering with the keepSeparator option not working as expected in the RecursiveCharacterTextSplitter is due to how separators are handled during the text splitting process. Specifically, when keepSeparator is set to true, the implementation should ensure that the separator (\n in your case) is preserved between chunks. However, it seems like the current logic mistakenly omits the separator from the final output.

To address this, you can modify the part of the code where the separator is managed. The correct approach would be to include the separator in the merged text only when keepSeparator is true. This involves adjusting the logic in the mergeSplits method or similar, where the decision to append the separator to the text is made.

Here's a brief outline of what the corrected code might look like:

mergeSplits(splits: string[], separator: string): string[] {
    // Adjust the logic here to ensure separator is included based on the keepSeparator flag
    const doc = this.joinDocs(currentDoc, this.keepSeparator ? separator : "");
    // Further logic to handle the splits
}

This adjustment ensures that the separator is correctly preserved in the output when keepSeparator is true, addressing the issue you've observed.

Given this insight, I recommend applying this fix to your local copy of the LangChain.js library and testing to see if it resolves the problem. If successful, consider contributing this fix back to the LangChain.js repository to help others who might face the same issue.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Apr 19 '24 08:04 dosubot[bot]

In simple terms, the work of the Splitter can be summarized in the following steps:

Perform text segmentation, retaining the separating symbols if the keepSeperator flag is set, such as \n.
Iterate through the segmented strings, and if the length is less than the set chunkSize, merge them sequentially.
After merging, iterate through each merged string to trim it, removing any leading or trailing whitespace characters.
Generate each Chunk based on the merged result, attaching a Chunk Header to each.

https://github.com/langchain-ai/langchainjs/blob/a8f7f5c4bd3f09fc5e3dd63b3e81426f4472125d/libs/langchain-textsplitters/src/text_splitter.ts#L179-L182

The cause of this issue is that when merging to generate pageContent, the method String.prototype.trim is used to filter out spaces at the beginning and end of the string. However, \n is also filtered out by trim. Therefore, when there is a split with a \n at the beginning, the \n at the beginning and end will be removed.

@guidev

Apr 25 '24 08:04 jeasonnow

However, based on the following reasons, I believe this issue may not require modification:

Leading and trailing whitespace characters do not affect the semantics of the text split out.
If excessive special whitespace characters (\n) appear at the beginning and end without filtering, it may generate more chunks, leading to additional token consumption.

So I closed my PR. If you have any better suggestions, we can discuss them.

Apr 25 '24 10:04 jeasonnow

I found this to be a problem, this is my workaround:

// Fixed version of RecursiveCharacterTextSplitter which respects keepSeparator
class FixedRecursiveCharacterTextSplitter extends RecursiveCharacterTextSplitter {
  async mergeSplits(splits: string[], separator: string): Promise<string[]> {
    const docs: string[] = [];
    const currentDoc: string[] = [];
    let total = 0;
    for (const d of splits) {
      const _len = await this.lengthFunction(d);
      if (
        total + _len + currentDoc.length * separator.length >
        this.chunkSize
      ) {
        if (total > this.chunkSize) {
          console.warn(
            `Created a chunk of size ${total}, +
which is longer than the specified ${this.chunkSize}`,
          );
        }
        if (currentDoc.length > 0) {
          const doc = this._joinDocs(currentDoc, separator);
          if (doc !== null) {
            docs.push(doc);
          }
          // Keep on popping if:
          // - we have a larger chunk than in the chunk overlap
          // - or if we still have any chunks and the length is long
          while (
            total > this.chunkOverlap ||
            (total + _len + currentDoc.length * separator.length >
              this.chunkSize &&
              total > 0)
          ) {
            total -= await this.lengthFunction(currentDoc[0]);
            currentDoc.shift();
          }
        }
      }
      currentDoc.push(d);
      total += _len;
    }
    const doc = this._joinDocs(currentDoc, separator);
    if (doc !== null) {
      docs.push(doc);
    }
    return docs;
  }

  // Do not trim if keepSeparator is enabled, or whitespace separators may be erased
  private _joinDocs(docs: string[], separator: string): string | null {
    const text = this.keepSeparator
      ? docs.join(separator)
      : docs.join(separator).trim();
    return text === "" ? null : text;
  }
}

If this is an acceptable solution, I'll make a PR.

May 02 '24 23:05 wcummings

langchainjs langchainjs copied to clipboard

keepSeparator doesn't work with RecursiveCharacterTextSplitter

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

Sources

About Dosu

langchainjs
langchainjs copied to clipboard