html2text icon indicating copy to clipboard operation
html2text copied to clipboard

Too much escaping breaks URLs containing parentheses

Open mborsetti opened this issue 4 years ago • 0 comments

  • Use case: output from html2textis emailed
  • Issue: URLs containing parentheses, brackets or other "markdown escapables" no longer link to the original resource due to insertion of backslashes

Steps to reproduce:

  1. Start with an html a tag with an href containing a parenthesis, e.g. to https://www.sample.com/?url-with-(parenthesized-text)-)-[and-brackets]
  2. Pass it through html2text (see code below)
  3. Email the resulting string to yourself, which in our example will be: [](https://www.sample.com/?url-with-\(parenthesized-text\)-\)\[and-brackets\])
  4. Open the email in a modern system (Gmail in my case)
  5. The clickable URL in the email; it will now point to a different resource than the original one, in our example (https://www.sample.com/?url-with-(parenthesized-text)-)[and-brackets] (notice the extra backslashes)

Potential solutions:

  1. In __init__.py, modify line 459 from self.o("]({url}{title})".format(url=escape_md(url), title=title)) toself.o("]({url}{title})".format(url=url, title=title)); I don't know Markdown specs well enough but after trying a few markdown readers, the lack of escaping inside a URL doesn't seem to break anything -- even with the stray extra ")"
  2. Add a switch to suppress Markdown escaping in URLs (new use case, slower code)
  3. Others?

Any tips/feedback?


Code:

import html2text
import sys

print(f'{sys.version=}')
print(f'{html2text.__version__=}\n')
      
html = ('<html>\n<head>\n</head>\n<body>\n'
        '<a href="https://www.sample.com/?url-with-(parenthesized-text)-)-[and-brackets]"></a>\n'
        '</body></html>')
parser = html2text.HTML2Text()
parser.body_width = 0
print(parser.handle(html))
sys.version='3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)]'
html2text.__version__=(2020, 1, 16)

[](https://www.sample.com/?url-with-\(parenthesized-text\)-\)-\[and-brackets\])`

mborsetti avatar May 07 '20 00:05 mborsetti