MarkdownTextSplitter removes formatting and line breaks
I was trying to use MarkdownTextSplitter to translate a document and maintain formatting, but I noticed that the splitter removed formatting from the markdown when splitting it.
As an example, the following markdown example when split with chunk_size=200 removes the "## " from the features line, as well as the line breaks preceding and following that line.
# Dillinger
- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨
## Features
- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--
When split using this code:
markdown_splitter = MarkdownTextSplitter(chunk_size=200, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_document])
for doc in docs:
print(doc.page_content)
The output becomes:
# Dillinger
- Type some Markdown on the left
- See HTML in the right
- ✨Magic ✨
Features
- Import a HTML file and watch it magically convert to Markdown
- Drag and drop images (requires your Dropbox account be linked)
- Import and save files from GitHub, Dropbox, Google Drive and One Drive
--
The formatting and line breaks around the "Features" line are removed. Expected behavior would be that each split doc, when combined, would be the original text.
Solution would be to never have formatting and line breaks removed, or, add the removed prefix/suffix in metadata or other keys so they could be used to re-construct the document with intact formatting.
~: pip show langchain
Name: langchain
Version: 0.0.138
These separators are removed from the text using the split() function here
Some separators should be re-added at the beginning of the text, like ##, and others at the end, like .
This would mean that for each separator we should specify how they should be re-added
Additionally, line breaks are removed here
It is even worst for the PythonCodeTextSplitter. The class and def keywords are removed:
class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):
"""Attempts to split the text along Python syntax."""
def __init__(self, **kwargs: Any):
"""Initialize a MarkdownTextSplitter."""
separators = [
# First, try to split along class definitions
"\nclass ",
"\ndef ",
"\n\tdef ",
# Now split by the normal type of lines
"\n\n",
"\n",
" ",
"",
]
super().__init__(separators=separators, **kwargs)
I wouldn't mind giving fixing this a try but would like to have some feedback/go-ahead from a maintainer for my solution proposal first:
Thinking split_text could be refractored to return the separator used as well, and this could be stored in the list of documents returned (should maybe be a new class SplitDocument? with key separator/prefix).
Any maintainer who thinks this sounds like a good or bad idea?
In my opinion each separator should contain the position to be re-added
It could use one of these formats:
- a list
- a dictionary
- an additional character at the beginning of the separator string
List
separators = [
["\n## ", START],
["\n```", START],
["\n\n", END],
[".", END],
["\n", END],
[" ", END],
["", nil],
]
Dictionary
separators = {
"\n## ": START,
"\n```": START,
"\n\n": END,
".": END,
"\n": END,
" ": END,
"": END,
}
First Character
S= Start ||B= BeforeE= End ||A= AfterX= Exclude ||O= Omit
separators = [
"S\n## ",
"S\n```",
"E\n\n",
"E.",
"E\n",
"E ",
"",
]
Not sure I understand - so in your opinion the separators should be returned separately from the docs?
No, the above would be in the configuration of separators for the MarkdownTextSplitter
Check here
Then the RecursiveCharacterTextSplitter should deal with the new format for the separators list
If the solution should be backwards compatible, then only the option with the list above would fit
The separators list could contain both strings and lists of string and a constant
In this example we have both:
separators = [
["\n## ", TextSplitter.position.START],
["\n### ", TextSplitter.position.START],
["\n```", TextSplitter.position.BOTH],
"\n\n",
[".", TextSplitter.position.END],
"\n",
" ",
""
]
So the code must identify the type of each element and act accordingly
In the case of markdown, code blocks should be returned with the "```" both at the start and at the end
So maybe we will need another constant (BOTH) to deal with it
definitely in favor of this
would be great to add a bunch of test cases for this when doing
+1. Besides, would be great to have something like splitting by headings only mode, which only splits by headings, not by code blocks etc. This will be very helpful to build an assistant that looks up code documentation.
The recursive text splitter will only use the next separator to further split the text if the current chunk size is bigger than the maximum size.
So, in the case of Markdown, if your document has small amount of text + code between headers, the content will not be further split and will be sent as a whole to the model
I am also interested on a fix for this. I am using LangChain to translate md files, and right now it's not possible due to this.
@tcapelle You can try another splitter, like the SpacyTextSplitter or the NLTKTextSplitter
Please report your findings
It is not easy do deal with Markdown.
Code should be split by functions or code blocks or at least entire lines. In python the function decorators and comments should be together with the functions. And other languages have their own peculiarities. Code blocks should be properly split for the model to be able to understand the code.
Tables should be split keeping entire rows of content.
Similar applies to Jupiter notebooks, and even normal books, papers, blogs...
It would be better to have a model to do this job, either a general purpose model (using proper prompt) or one trained/fine-tuned specifically for this task
I actually just want the same behavior from the MarkdownTextSplitter, but putting back the separators. My test is that if I create a identity chain:
class IdentityChain:
def __init__(self): pass
def run(self, text=None, **kwargs): return text
and I stack the documents back I should get the "almost" same input.
What is the best and safest way to split markdown? Is this the best implementation online?
@kroggen I'm confused by this issue. This has nothing to do with markdown, it seems like keep_separator is never respected for any case.
@pseudotensor yep universal for all recursive text splitters i think
Hi, @vbelius! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you raised is about the MarkdownTextSplitter in the langchain library removing formatting and line breaks when splitting a markdown document. There has been a discussion among users and maintainers about different approaches to solve this issue, including suggestions for the format of separators and the need for a general-purpose model to handle markdown splitting. However, the issue remains unresolved at this time.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your understanding and contributions to the LangChain project!