langchain
                                
                                
                                
                                    langchain copied to clipboard
                            
                            
                            
                        community: fix MarkdownHeaderTextSplitter fails to parse headers with non-printable characters
Description: MarkdownHeaderTextSplitter Fails to Parse Headers with non-printable characters. more #20643
The following is the official test case. Just replacing # Foo\n\n with \ufeff# Foo\n\n will cause the test case to fail.
chunk metadata is empty
def test_md_header_text_splitter_1() -> None:
    """Test markdown splitter by header: Case 1."""
    markdown_document = (
        "\ufeff# Foo\n\n"
        "    ## Bar\n\n"
        "Hi this is Jim\n\n"
        "Hi this is Joe\n\n"
        " ## Baz\n\n"
        " Hi this is Molly"
    )
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
    ]
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
    )
    output = markdown_splitter.split_text(markdown_document)
    expected_output = [
        Document(
            page_content="Hi this is Jim  \nHi this is Joe",
            metadata={"Header 1": "Foo", "Header 2": "Bar"},
        ),
        Document(
            page_content="Hi this is Molly",
            metadata={"Header 1": "Foo", "Header 2": "Baz"},
        ),
    ]
    assert output == expected_output
twitter: @coolbeevip
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Ignored Deployment
| Name | Status | Preview | Comments | Updated (UTC) | 
|---|---|---|---|---|
| langchain | ⬜️ Ignored (Inspect) | Visit Preview | Apr 25, 2024 0:04am | 
2 workflows awaiting approval
Could you please take a look when you have a moment? @hwchase17 @baskaryan