html2text icon indicating copy to clipboard operation
html2text copied to clipboard

Semicolon in Text with &#.

Open radze90 opened this issue 4 years ago • 1 comments

html2text version 2020.1.16 Python version 3.9.5

import html2text

test = html2text.HTML2Text()
text = test.handle("<p>Sample text K&N. Sample text.</p>")
print(text)

output: Sample text K&N.; Sample text.

Hi,

I noticed that the module inserts a simcolon in the text when converting a certain string, which I don't want. It doesn't matter which character comes after the &. Is this intentional and is it possible to work around this or is this a bug?

radze90 avatar Jul 22 '21 16:07 radze90

I confirm this issue.

My sample:

ZZZ
ZZ&Z
ZZ#Z
https://some.site.com/index.php?r=billMail/confirmNewBillMail&code=pYgJeYbpnSsaGdSRoKgfa9bd0fb4248dbb437c745afbb6d1b29tvPsONXEQApNxxswCSZ

Output after html2text:

ZZZ
ZZ&Z;
ZZ#Z
https://some.site.com/index.php?r=billMail/confirmNewBillMail&code;=pYgJeYbpnSsaGdSRoKgfa9bd0fb4248dbb437c745afbb6d1b29tvPsONXEQApNxxswCSZ

@Alir3z4 please, fix this. We cant use html2text to parse URLs since html2text add semicolon into URL.

MonkzCode avatar Feb 20 '24 09:02 MonkzCode