llama_index
llama_index copied to clipboard
[Bug]: MarkdownReader removes hyperlink URLs enclosed in angular brackets
Bug Description
The postprocessing in MarkdownReader.markdown_to_tups
removes html by replacing content within angular brackets by an empty string. However, AFAIK it is valid to enclose the URL of an hyperlink in markdown inside angular brackets. The said postprocessing stel therefore removes the URL altogether.
Enclosing the URL within angular brackets is one way to work with URLs that contain spaces. The other way would be to use percent-encoding but that's just a side note.
I know it's a rather rare case but I'd consider it a bug nonetheless.
Version
0.10.37
Steps to Reproduce
Create a markdown file (no need to do it manually, the script below does it automatically)
# Heading
This is just an arbitrary text contaning a (link to a local file)[<my/dir/my file2.md> "Referenced File"] that has spaces in the name.
save it to my/dir/file.md
and read it using
import os
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import MarkdownReader
os.makedirs("my/dir", exist_ok=True)
content = """
# Heading
This is just an arbitrary text contaning a (link to a local file)[<my/dir/my file2.md> "Referenced File"] that has spaces in the name.
"""
if not os.path.exists("my/dir/file.md"):
with open("my/dir/file.md", "w", encoding="utf-8") as f:
f.write(content)
documents = SimpleDirectoryReader(
"my/dir",
filename_as_id=False,
recursive=True,
required_exts=[".md"],
file_extractor={".md": MarkdownReader(remove_hyperlinks=False, remove_images=False)},
).load_data()
print(documents[1].text)
This prints
Heading
This is just an arbitrary text contaning a (link to a local file)[ "Referenced File"] that has spaces in the name.
Relevant Logs/Tracbacks
No response