💠Jsoup - Not able to identify escaped/unescaped html entity in the text nodes
Not able to identify whether the input document has & or & in the text node, since Jsoup escapes the character in text node. Same goes to other entities like </<.
This does not provide any control to the Jsoup users where they can take any action based on input. For example; If we want to remove < character in text node but preserve when given as entity <
Note: Please let me know if there is already a way to differentiate this.
Providing an option where I could input Jsoup to not modify the text node will be super helpful. This provides more flexibility and control to the customers.
@jhy
Tried different methods in TextNode to get the original input text content, but did not worked.
Example:
Input: <p> actual_lt: < || escaped_lt: < </p>
for (TextNode textNode : doc.selectFirst("p").textNodes()) {
System.out.println("textNode.toString():-" + textNode.toString());
System.out.println("textNode.text():-" + textNode.text());
System.out.println("textNode.getWholeText():-" + textNode.getWholeText());
System.out.println("textNode.outerHtml():-" + textNode.outerHtml());
}
Expected (in any one of the method): actual_lt: < || escaped_lt: <
Output:
textNode.toString():- actual_lt: < || escaped_lt: <
textNode.text():- actual_lt: < || escaped_lt: <
textNode.getWholeText():- actual_lt: < || escaped_lt: <
textNode.outerHtml():- actual_lt: < || escaped_lt: <
@jhy
Can you explain the value here of this suggestion? What's a real example where this would be helpful?
With this bug we'll not be able to differentiate whether  hello or &nbsphello was present in the input. For any other bugs, a method like textNode.getRawText() which returns the text node in the input as it is will help users to apply a temp fix.
in the input will be changed to CR character and = will be changed equals to character once they are parsed by Jsoup.
However, I prefer html entities to not unescape after parsing them. I even tried xhtml escape mode which is expected to not change any html entities (except lt, gt, amp, and quot), but the result is always the same as base escape mode. Tried different versions and not working. Not sure if this is a bug
Sure, but I am not clear on what you are actually trying to achieve. What feature are you trying to build that would utilize functionality like this?
jsoup parses HTML. That's fundamentally what it does. If we didn't decode, we wouldn't be parsing.
Escape modes control the output, not the input. Input HTML always decodes using the full set.
(Closing; not planned)