langchain
langchain copied to clipboard
community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters
Description: MarkdownHeaderTextSplitter Fails to Parse Headers with non-printable characters. more #20643
The following is the official test case. Just replacing # Foo\n\n
with \ufeff# Foo\n\n
will cause the test case to fail.
chunk metadata is empty
def test_md_header_text_splitter_1() -> None:
"""Test markdown splitter by header: Case 1."""
markdown_document = (
"\ufeff# Foo\n\n"
" ## Bar\n\n"
"Hi this is Jim\n\n"
"Hi this is Joe\n\n"
" ## Baz\n\n"
" Hi this is Molly"
)
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
)
output = markdown_splitter.split_text(markdown_document)
expected_output = [
Document(
page_content="Hi this is Jim \nHi this is Joe",
metadata={"Header 1": "Foo", "Header 2": "Bar"},
),
Document(
page_content="Hi this is Molly",
metadata={"Header 1": "Foo", "Header 2": "Baz"},
),
]
assert output == expected_output
twitter: @coolbeevip
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Ignored Deployment
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
langchain | ⬜️ Ignored (Inspect) | Visit Preview | Apr 25, 2024 0:04am |
2 workflows awaiting approval
Could you please take a look when you have a moment? @hwchase17 @baskaryan