langchain MarkdownTextSplitter removes formatting and line breaks

I was trying to use MarkdownTextSplitter to translate a document and maintain formatting, but I noticed that the splitter removed formatting from the markdown when splitting it.

As an example, the following markdown example when split with chunk_size=200 removes the "## " from the features line, as well as the line breaks preceding and following that line.

# Dillinger

- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨

## Features

- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--

When split using this code:

markdown_splitter = MarkdownTextSplitter(chunk_size=200, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_document])

for doc in docs:
    print(doc.page_content)

The output becomes:

# Dillinger

- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨
Features
- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--

The formatting and line breaks around the "Features" line are removed. Expected behavior would be that each split doc, when combined, would be the original text.

Solution would be to never have formatting and line breaks removed, or, add the removed prefix/suffix in metadata or other keys so they could be used to re-construct the document with intact formatting.

Full code example

~: pip show langchain
Name: langchain
Version: 0.0.138

Apr 13 '23 15:04 vbelius

These separators are removed from the text using the split() function here

Some separators should be re-added at the beginning of the text, like ##, and others at the end, like .

This would mean that for each separator we should specify how they should be re-added

Additionally, line breaks are removed here

Apr 13 '23 18:04 kroggen

It is even worst for the PythonCodeTextSplitter. The class and def keywords are removed:

class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):
    """Attempts to split the text along Python syntax."""

    def __init__(self, **kwargs: Any):
        """Initialize a MarkdownTextSplitter."""
        separators = [
            # First, try to split along class definitions
            "\nclass ",
            "\ndef ",
            "\n\tdef ",
            # Now split by the normal type of lines
            "\n\n",
            "\n",
            " ",
            "",
        ]
        super().__init__(separators=separators, **kwargs)

Apr 14 '23 00:04 kroggen

I wouldn't mind giving fixing this a try but would like to have some feedback/go-ahead from a maintainer for my solution proposal first:

Thinking split_text could be refractored to return the separator used as well, and this could be stored in the list of documents returned (should maybe be a new class SplitDocument? with key separator/prefix).

Any maintainer who thinks this sounds like a good or bad idea?

Apr 14 '23 06:04 vbelius

In my opinion each separator should contain the position to be re-added

It could use one of these formats:

a list
a dictionary
an additional character at the beginning of the separator string

List

  separators = [
      ["\n## ", START],
      ["\n```", START],
      ["\n\n", END],
      [".", END],
      ["\n", END],
      [" ", END],
      ["", nil],
  ]

Dictionary

  separators = {
      "\n## ": START,
      "\n```": START,
      "\n\n": END,
      ".": END,
      "\n": END,
      " ": END,
      "": END,
  }

First Character

S = Start || B = Before
E = End || A = After
X = Exclude || O = Omit

  separators = [
      "S\n## ",
      "S\n```",
      "E\n\n",
      "E.",
      "E\n",
      "E ",
      "",
  ]

Apr 14 '23 08:04 kroggen

Not sure I understand - so in your opinion the separators should be returned separately from the docs?

Apr 14 '23 08:04 vbelius

No, the above would be in the configuration of separators for the MarkdownTextSplitter

Check here

Then the RecursiveCharacterTextSplitter should deal with the new format for the separators list

Apr 14 '23 08:04 kroggen

If the solution should be backwards compatible, then only the option with the list above would fit

The separators list could contain both strings and lists of string and a constant

In this example we have both:

  separators = [
      ["\n## ", TextSplitter.position.START],
      ["\n### ", TextSplitter.position.START],
      ["\n```", TextSplitter.position.BOTH],
      "\n\n",
      [".", TextSplitter.position.END],
      "\n",
      " ",
      ""
  ]

So the code must identify the type of each element and act accordingly

Apr 15 '23 19:04 kroggen

In the case of markdown, code blocks should be returned with the "```" both at the start and at the end

So maybe we will need another constant (BOTH) to deal with it

Apr 15 '23 19:04 kroggen

definitely in favor of this

would be great to add a bunch of test cases for this when doing

Apr 17 '23 18:04 hwchase17

+1. Besides, would be great to have something like splitting by headings only mode, which only splits by headings, not by code blocks etc. This will be very helpful to build an assistant that looks up code documentation.

Apr 26 '23 08:04 ifsheldon

The recursive text splitter will only use the next separator to further split the text if the current chunk size is bigger than the maximum size.

So, in the case of Markdown, if your document has small amount of text + code between headers, the content will not be further split and will be sent as a whole to the model

Apr 26 '23 17:04 kroggen

I am also interested on a fix for this. I am using LangChain to translate md files, and right now it's not possible due to this.

May 02 '23 16:05 tcapelle

@tcapelle You can try another splitter, like the SpacyTextSplitter or the NLTKTextSplitter

Please report your findings

May 02 '23 17:05 kroggen

It is not easy do deal with Markdown.

Code should be split by functions or code blocks or at least entire lines. In python the function decorators and comments should be together with the functions. And other languages have their own peculiarities. Code blocks should be properly split for the model to be able to understand the code.

Tables should be split keeping entire rows of content.

Similar applies to Jupiter notebooks, and even normal books, papers, blogs...

It would be better to have a model to do this job, either a general purpose model (using proper prompt) or one trained/fine-tuned specifically for this task

May 02 '23 18:05 kroggen

I actually just want the same behavior from the MarkdownTextSplitter, but putting back the separators. My test is that if I create a identity chain:

class IdentityChain:
    def __init__(self): pass
    def run(self, text=None, **kwargs): return text

and I stack the documents back I should get the "almost" same input.

May 02 '23 18:05 tcapelle

What is the best and safest way to split markdown? Is this the best implementation online?

Jun 08 '23 18:06 sergenti

@kroggen I'm confused by this issue. This has nothing to do with markdown, it seems like keep_separator is never respected for any case.

Jun 10 '23 07:06 pseudotensor

@pseudotensor yep universal for all recursive text splitters i think

Jun 10 '23 18:06 cktang88

Hi, @vbelius! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you raised is about the MarkdownTextSplitter in the langchain library removing formatting and line breaks when splitting a markdown document. There has been a discussion among users and maintainers about different approaches to solve this issue, including suggestions for the format of separators and the need for a general-purpose model to handle markdown splitting. However, the issue remains unresolved at this time.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contributions to the LangChain project!

Oct 11 '23 16:10 dosubot[bot]