langchain community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters

community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters

Open coolbeevip opened this issue 2 months ago • 2 comments

Description: MarkdownHeaderTextSplitter Fails to Parse Headers with non-printable characters. more #20643

The following is the official test case. Just replacing # Foo\n\n with \ufeff# Foo\n\n will cause the test case to fail.

chunk metadata is empty

def test_md_header_text_splitter_1() -> None:
    """Test markdown splitter by header: Case 1."""

    markdown_document = (
        "\ufeff# Foo\n\n"
        "    ## Bar\n\n"
        "Hi this is Jim\n\n"
        "Hi this is Joe\n\n"
        " ## Baz\n\n"
        " Hi this is Molly"
    )
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
    ]
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
    )
    output = markdown_splitter.split_text(markdown_document)
    expected_output = [
        Document(
            page_content="Hi this is Jim  \nHi this is Joe",
            metadata={"Header 1": "Foo", "Header 2": "Bar"},
        ),
        Document(
            page_content="Hi this is Molly",
            metadata={"Header 1": "Foo", "Header 2": "Baz"},
        ),
    ]
    assert output == expected_output

twitter: @coolbeevip

Apr 19 '24 06:04 coolbeevip

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Apr 25, 2024 0:04am

Apr 19 '24 06:04 vercel[bot]

2 workflows awaiting approval

Could you please take a look when you have a moment? @hwchase17 @baskaryan

Apr 20 '24 16:04 coolbeevip

langchain langchain copied to clipboard

community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters

langchain
langchain copied to clipboard