node-html-to-text icon indicating copy to clipboard operation
node-html-to-text copied to clipboard

String after < is completely removed, if it is not followed by a space

Open ajayRaghav37 opened this issue 7 years ago • 7 comments

SNIPPET TO REPRODUCE

const htmlToText = require('html-to-text');

let textResponse = htmlToText.fromString('<p>there are definitely <10,000 terrestrial planets in the universe. Only few of them would be habitable for future human.</p>', {
    wordwrap: false
});

console.log(textResponse);

EXPECTED there are definitely <10,000 terrestrial planets in the universe. Only few of them would be habitable for future human.

ACTUAL OUTPUT there are definitely

ajayRaghav37 avatar Jun 14 '18 14:06 ajayRaghav37

The problem is that < is interpreted as an opening tag you need to replace it with &lt;. This is not a problem of this module the problems seems to be the used html parser.

mlegenhausen avatar Jun 15 '18 07:06 mlegenhausen

Cannot replace all < with &lt; as the input is dynamic and I need to preserve both, the HTML text as well as plain text. It wouldn't be a problem with correct use of punctuation by the user 😄

Anyway, I will try to do a workaround on this and will post once I am done.

ajayRaghav37 avatar Jul 03 '18 15:07 ajayRaghav37

This particular example doesn't reproduce in version 7. htmlparser2 got smarter recently and doesn't consider <10,000 ... as a tag anymore. Still, it's not perfect and can be confused in other situations, such as <ten thousand ....

KillyMXI avatar Feb 16 '21 23:02 KillyMXI

You know what? Even Blink (Chrome's engine) is confused by <ten thousand .... I suppose it might be a performance optimization - being ready to unroll the parser state when something doesn't make sense might be costly and not worth it on a scale.

HTML spec also doesn't seem to be helpful - it is really permissive about tag attributes and doesn't even ban < character from occurrence inside a tag as far as I can see.

It requires some effort to collect the behavior across numerous JS HTML parsers. So far I know that Angular has a particularly smart parser, but that's probably not a great dependency for a project like html-to-text. The majority seems to allow out-of-spec stuff such as non-alphanumeric tag names, much like Blink.

KillyMXI avatar Feb 17 '21 16:02 KillyMXI

Ok, now I'm pretty confident there is no parser to switch to in order to address this issue. https://astexplorer.net/ contains most of the ones worth looking, and I made a PR there for the only one missing. There are more projects but those are either unhealthy or reusing one of the parsers such as parse5.

@angular/compiler contains a nice parser but in itself it doesn't look like a good dependency. Forking it might be a way to go but I'm not convinced it is the right way to go. I would prefer not to maintain a parser too...

If there is a nice example on how a certain html fragment should be interpreted according to the spec and how it is different in AST explorer - that better be filed upstream (in the parser repo, htmlparser2).

I'll keep this issue open as a reference but I don't have any more to do about it, for now at least.

KillyMXI avatar Feb 21 '21 18:02 KillyMXI

I am facing the same issue even if my html being passed has $lt; instead of <. My html:

<div>
    <ul>
        <li><i>Point 1 - this is point 1</i></li>
        <li><span style="font-weight: 700;">Point 2 - &lt;this is point 2&gt;</span></li>
    </ul>
</div>

Output completely skips this is point 2

sairupesh avatar Apr 13 '21 14:04 sairupesh

@sairupesh I can't reproduce this. Sounds like you're unescaping html somewhere in your pipeline before calling html-to-text.

const text = htmlToText(
  `<div>
  <ul>
      <li><i>Point 1 - this is point 1</i></li>
      <li><span style="font-weight: 700;">Point 2 - &lt;this is point 2&gt;</span></li>
  </ul>
</div>`
);
console.log(text);
 * Point 1 - this is point 1
 * Point 2 - <this is point 2>

KillyMXI avatar Apr 13 '21 14:04 KillyMXI