html2text icon indicating copy to clipboard operation
html2text copied to clipboard

Abusive removal of br nodes leads to incorrect output

Open ahfeel opened this issue 7 years ago • 4 comments

Hello !

There is some code doing intentional removal of
nodes when they are the last child of a node that also contained text. Here's a very simple example about how this can lead to incorrect results (this is stuff I'm receiving from bad html emails) :

<font size="+1">Vikings: Wolves of Midgard<br></font><font size="+1">Valkyria Chronicles<br>
<br>
World Of Warcraft Battlechest</font>

The expected output would be

Vikings: Wolves of Midgard
Valkyria Chronicles

World Of Warcraft Battlechest

The actual output is:

Vikings: Wolves of MidgardValkyria Chronicles

World Of Warcraft Battlechest

ahfeel avatar Oct 17 '17 15:10 ahfeel

I agree this is a bug - if anyone would have the chance to make a PR (with tests) that fixes this, that would be amazing!

soundasleep avatar Oct 24 '17 21:10 soundasleep

I put in a PR a few days ago to address this:

https://github.com/soundasleep/html2text/pull/75

NirvashPrime avatar Oct 13 '19 02:10 NirvashPrime

Hey i want t try to fix these problem

Deepakchawde avatar Oct 10 '20 05:10 Deepakchawde

@soundasleep can help with PR #75? Thanks!

bilogic avatar Apr 28 '22 10:04 bilogic