mistune icon indicating copy to clipboard operation
mistune copied to clipboard

Inline link HREF get confused with closing parenthesis in URL

Open calvincramer opened this issue 8 months ago • 1 comments

A valid URL https://en.wikipedia.org/wiki/Selection_(genetic_algorithm)#Rank_selection with open and closing parenthesis gets cut off to https://en.wikipedia.org/wiki/Selection_(genetic_algorithm because the closing parenthesis is interpreted as the closing part of the inline URL format [alt text](url).

I think the culprit is the \) here: https://github.com/lepture/mistune/blob/main/src/mistune/helpers.py#L15 in the parse_link_href() function.

Here's a simple way to reproduce:

import re

PREVENT_BACKSLASH = r"(?<!\\)(?:\\\\)*"
LINK_HREF_INLINE_RE = re.compile(r"[ \t]*\n?[ \t]*([^ \t\n]*?)(?:[ \t\n]|(?:" + PREVENT_BACKSLASH + r"\)))")

src = "https://en.wikipedia.org/wiki/Selection_(genetic_algorithm)#Rank_selection)\nblah blah further text"
m = LINK_HREF_INLINE_RE.match(src, 0)
end_pos = m.end()
href = m.group(1)
print(href)

My attempt to fix:

First, a very simplified version of the regex is: r"([^ \t\n]*?)\)"

  1. Making the main capture group not greedy fixes it: r"([^ \t\n]*)\)"
  2. Adding a balanced parenthesis matcher: r"((?:[^ \t\n()]*?)(?:\([^ \t\n()]*\))*?(?:[^ \t\n()]*?))\)"

I think URLs allow any amount of parenthesis though, so they don't need to be balanced.

calvincramer avatar Apr 23 '25 20:04 calvincramer

Just some portability info for reference. On Babelmark, most Markdown parsers get this right, but a sizeable minority doesn't:

https://babelmark.github.io/?text=%5BTest%5D(https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSelection_(genetic_algorithm)%23Rank_selection)

Conversely, %-escaping ( and ) in URLs works universally:

https://babelmark.github.io/?text=%5BTest%5D(https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSelection_%2528genetic_algorithm%2529%23Rank_selection)

mentalisttraceur avatar Apr 27 '25 02:04 mentalisttraceur

To parse it correctly, we need to parse the link character by character. For now, mistune is using regex.

lepture avatar Dec 21 '25 04:12 lepture