Inline link HREF get confused with closing parenthesis in URL
A valid URL https://en.wikipedia.org/wiki/Selection_(genetic_algorithm)#Rank_selection with open and closing parenthesis gets cut off to https://en.wikipedia.org/wiki/Selection_(genetic_algorithm because the closing parenthesis is interpreted as the closing part of the inline URL format [alt text](url).
I think the culprit is the \) here: https://github.com/lepture/mistune/blob/main/src/mistune/helpers.py#L15
in the parse_link_href() function.
Here's a simple way to reproduce:
import re
PREVENT_BACKSLASH = r"(?<!\\)(?:\\\\)*"
LINK_HREF_INLINE_RE = re.compile(r"[ \t]*\n?[ \t]*([^ \t\n]*?)(?:[ \t\n]|(?:" + PREVENT_BACKSLASH + r"\)))")
src = "https://en.wikipedia.org/wiki/Selection_(genetic_algorithm)#Rank_selection)\nblah blah further text"
m = LINK_HREF_INLINE_RE.match(src, 0)
end_pos = m.end()
href = m.group(1)
print(href)
My attempt to fix:
First, a very simplified version of the regex is: r"([^ \t\n]*?)\)"
- Making the main capture group not greedy fixes it:
r"([^ \t\n]*)\)" - Adding a balanced parenthesis matcher:
r"((?:[^ \t\n()]*?)(?:\([^ \t\n()]*\))*?(?:[^ \t\n()]*?))\)"
I think URLs allow any amount of parenthesis though, so they don't need to be balanced.
Just some portability info for reference. On Babelmark, most Markdown parsers get this right, but a sizeable minority doesn't:
https://babelmark.github.io/?text=%5BTest%5D(https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSelection_(genetic_algorithm)%23Rank_selection)
Conversely, %-escaping ( and ) in URLs works universally:
https://babelmark.github.io/?text=%5BTest%5D(https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSelection_%2528genetic_algorithm%2529%23Rank_selection)
To parse it correctly, we need to parse the link character by character. For now, mistune is using regex.