html2text
html2text copied to clipboard
Abusive removal of br nodes leads to incorrect output
Hello !
There is some code doing intentional removal of
nodes when they are the last child of a node that also contained text. Here's a very simple example about how this can lead to incorrect results (this is stuff I'm receiving from bad html emails) :
<font size="+1">Vikings: Wolves of Midgard<br></font><font size="+1">Valkyria Chronicles<br>
<br>
World Of Warcraft Battlechest</font>
The expected output would be
Vikings: Wolves of Midgard
Valkyria Chronicles
World Of Warcraft Battlechest
The actual output is:
Vikings: Wolves of MidgardValkyria Chronicles
World Of Warcraft Battlechest
I agree this is a bug - if anyone would have the chance to make a PR (with tests) that fixes this, that would be amazing!
I put in a PR a few days ago to address this:
https://github.com/soundasleep/html2text/pull/75
Hey i want t try to fix these problem
@soundasleep can help with PR #75? Thanks!