llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

[Bug]: MarkdownReader removes hyperlink URLs enclosed in angular brackets

Open enrico-stauss opened this issue 8 months ago • 3 comments

Bug Description

The postprocessing in MarkdownReader.markdown_to_tups removes html by replacing content within angular brackets by an empty string. However, AFAIK it is valid to enclose the URL of an hyperlink in markdown inside angular brackets. The said postprocessing stel therefore removes the URL altogether.

Enclosing the URL within angular brackets is one way to work with URLs that contain spaces. The other way would be to use percent-encoding but that's just a side note.

I know it's a rather rare case but I'd consider it a bug nonetheless.

Version

0.10.37

Steps to Reproduce

Create a markdown file (no need to do it manually, the script below does it automatically)

# Heading
This is just an arbitrary text contaning a (link to a local file)[<my/dir/my file2.md> "Referenced File"] that has spaces in the name.

save it to my/dir/file.md and read it using

import os
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import MarkdownReader


os.makedirs("my/dir", exist_ok=True)
content = """
# Heading
This is just an arbitrary text contaning a (link to a local file)[<my/dir/my file2.md> "Referenced File"] that has spaces in the name.
"""
if not os.path.exists("my/dir/file.md"):
    with open("my/dir/file.md", "w", encoding="utf-8") as f:
        f.write(content)


documents = SimpleDirectoryReader(
    "my/dir",
    filename_as_id=False,
    recursive=True,
    required_exts=[".md"],
    file_extractor={".md": MarkdownReader(remove_hyperlinks=False, remove_images=False)},
).load_data()
print(documents[1].text)

This prints

Heading
This is just an arbitrary text contaning a (link to a local file)[ "Referenced File"] that has spaces in the name.

Relevant Logs/Tracbacks

No response

enrico-stauss avatar May 29 '24 09:05 enrico-stauss