node-html-to-text
node-html-to-text copied to clipboard
String after < is completely removed, if it is not followed by a space
SNIPPET TO REPRODUCE
const htmlToText = require('html-to-text');
let textResponse = htmlToText.fromString('<p>there are definitely <10,000 terrestrial planets in the universe. Only few of them would be habitable for future human.</p>', {
wordwrap: false
});
console.log(textResponse);
EXPECTED
there are definitely <10,000 terrestrial planets in the universe. Only few of them would be habitable for future human.
ACTUAL OUTPUT
there are definitely
The problem is that < is interpreted as an opening tag you need to replace it with <. This is not a problem of this module the problems seems to be the used html parser.
Cannot replace all < with < as the input is dynamic and I need to preserve both, the HTML text as well as plain text. It wouldn't be a problem with correct use of punctuation by the user 😄
Anyway, I will try to do a workaround on this and will post once I am done.
This particular example doesn't reproduce in version 7.
htmlparser2 got smarter recently and doesn't consider <10,000 ... as a tag anymore.
Still, it's not perfect and can be confused in other situations, such as <ten thousand ....
You know what? Even Blink (Chrome's engine) is confused by <ten thousand ....
I suppose it might be a performance optimization - being ready to unroll the parser state when something doesn't make sense might be costly and not worth it on a scale.
HTML spec also doesn't seem to be helpful - it is really permissive about tag attributes and doesn't even ban < character from occurrence inside a tag as far as I can see.
It requires some effort to collect the behavior across numerous JS HTML parsers. So far I know that Angular has a particularly smart parser, but that's probably not a great dependency for a project like html-to-text. The majority seems to allow out-of-spec stuff such as non-alphanumeric tag names, much like Blink.
Ok, now I'm pretty confident there is no parser to switch to in order to address this issue.
https://astexplorer.net/ contains most of the ones worth looking, and I made a PR there for the only one missing.
There are more projects but those are either unhealthy or reusing one of the parsers such as parse5.
@angular/compiler contains a nice parser but in itself it doesn't look like a good dependency. Forking it might be a way to go but I'm not convinced it is the right way to go. I would prefer not to maintain a parser too...
If there is a nice example on how a certain html fragment should be interpreted according to the spec and how it is different in AST explorer - that better be filed upstream (in the parser repo, htmlparser2).
I'll keep this issue open as a reference but I don't have any more to do about it, for now at least.
I am facing the same issue even if my html being passed has $lt; instead of <.
My html:
<div>
<ul>
<li><i>Point 1 - this is point 1</i></li>
<li><span style="font-weight: 700;">Point 2 - <this is point 2></span></li>
</ul>
</div>
Output completely skips this is point 2
@sairupesh I can't reproduce this. Sounds like you're unescaping html somewhere in your pipeline before calling html-to-text.
const text = htmlToText(
`<div>
<ul>
<li><i>Point 1 - this is point 1</i></li>
<li><span style="font-weight: 700;">Point 2 - <this is point 2></span></li>
</ul>
</div>`
);
console.log(text);
* Point 1 - this is point 1
* Point 2 - <this is point 2>