html2text
html2text copied to clipboard
Too much escaping breaks URLs containing parentheses
- Use case: output from
html2text
is emailed - Issue: URLs containing parentheses, brackets or other "markdown escapables" no longer link to the original resource due to insertion of backslashes
Steps to reproduce:
- Start with an html
a
tag with anhref
containing a parenthesis, e.g. to https://www.sample.com/?url-with-(parenthesized-text)-)-[and-brackets] - Pass it through
html2text
(see code below) - Email the resulting string to yourself, which in our example will be:
[](https://www.sample.com/?url-with-\(parenthesized-text\)-\)\[and-brackets\])
- Open the email in a modern system (Gmail in my case)
- The clickable URL in the email; it will now point to a different resource than the original one, in our example (https://www.sample.com/?url-with-(parenthesized-text)-)[and-brackets] (notice the extra backslashes)
Potential solutions:
- In
__init__.py
, modify line 459 fromself.o("]({url}{title})".format(url=escape_md(url), title=title))
toself.o("]({url}{title})".format(url=url, title=title))
; I don't know Markdown specs well enough but after trying a few markdown readers, the lack of escaping inside a URL doesn't seem to break anything -- even with the stray extra ")" - Add a switch to suppress Markdown escaping in URLs (new use case, slower code)
- Others?
Any tips/feedback?
Code:
import html2text
import sys
print(f'{sys.version=}')
print(f'{html2text.__version__=}\n')
html = ('<html>\n<head>\n</head>\n<body>\n'
'<a href="https://www.sample.com/?url-with-(parenthesized-text)-)-[and-brackets]"></a>\n'
'</body></html>')
parser = html2text.HTML2Text()
parser.body_width = 0
print(parser.handle(html))
sys.version='3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)]'
html2text.__version__=(2020, 1, 16)
[](https://www.sample.com/?url-with-\(parenthesized-text\)-\)-\[and-brackets\])`