jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Distinction between inserted empty nodes and spaces impossible

Open leomayer opened this issue 2 years ago • 4 comments

I have the following HTML code

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<p><span>Test</span> 1.10. ext</p>
<p><span>Test</span> <span>1.11. ext</span></p>
<p>Test <span>1.12. ext</span></p>
</html>

The result for the 3 lines look the same in the browser. Currently during parsing JSoup adds empty TextNodes. I ignore them since they might introduce a different layout. The issue I'm facing is that the TextNode for line with 1.11. is as well empty. I cannot distinguish between the artifical inserted nodes from Jsoup vs them which are required. There is also no specific property to check which applies for this case.

leomayer avatar Jun 17 '22 18:06 leomayer

Hi,

It's not clear to me what TextNodes that you are referring to. Do you mean the TextNodes in the children of the body element, between the p elements?

Screen Shot 2022-06-19 at 11 40 43 am

Those are not artificial / inserted / virtual. Those are TextNodes that hold the \n character (and any other spaces e.g. if the HTML was indented). So there is nothing to distinguish; they are just regular textnodes that the tokeniser & parser saw while parsing the input.

I ignore them since they might introduce a different layout.

What is the underlying issue that you are trying to solve? There should be no cases of using jsoup generated HTML (via the html() methods, with pretty-printing on or off, that causes a layout issue. If there is, can you please post the specific HTML that causes that? (Other than CSS stylesheets redefining a block to be whitespace sensitive, which if pretty-printing, may normalize whitespace).

I did make some improvements in e714ef12fab4fd00cf7133a22fba4a71ccf7af8e recently which improved the newline normalization on HTML serialization.

jhy avatar Jun 19 '22 01:06 jhy

My major problem is to distinguish between a TextNode with a regular empty white space and a TextNode which is just a linefeed. The later one I ignore while the first one I cannot. The linefeed is as well system dependent, which means it differs from OS to OS.

At the very moment I check the TextNode with isBlank() and ignore it if it returns true.

To what I've read from your response is that I cannot use the method isBlank() since I need to figure out if the TextNode isLineSeparator(). Which is doable but honestly I don't need the TextNodes which are just LineSepartors from the input. My preferred solution would be to ignore them anyways. Now I understand that this was my MAJOR point in #1361 to ignore them.

My other proposal would be to have a property which distinguish if it the TextNode was created via (system) linefeed or not.

My last approach would be to check specifically on my own if its an linefeed or not.

leomayer avatar Jun 19 '22 07:06 leomayer

I would still be interested to know the answer to my Q:

What is the underlying issue that you are trying to solve? There should be no cases of using jsoup generated HTML (via the html() methods, with pretty-printing on or off, that causes a layout issue. If there is, can you please post the specific HTML that causes that? (Other than CSS stylesheets redefining a block to be whitespace sensitive, which if pretty-printing, may normalize whitespace).

I.e., why do you need this?

If you just strip newlines, depending on if those are from a block or an inline element, you are going to get issues with text running together.

But yes, you could certainly test the TextNode value (or getWholeText) and handle newline characters.

I do have a change I will finish soon that improves endline newline normalization in the html() output so that when a textnode is indented, it doesn't end up with a trailing space. Not sure if that is relevant here.

jhy avatar Jun 22 '22 06:06 jhy

First of all I'm glad to have found the root cause of my misery - for coding ;-) Well I was NOT aware that line feed was the root cause of the different changes.

I.e., why do you need this?

The major task is to transfer an HTML file to a PDF document. Therefore I need to treat some special issues. One of them is that I don't want to have extra spaces where either the text should be stitched together or when not necessary (and causing some other problems on the long run).

<p><span>Tester</span>\n
<span>1.11. ext</span></p>\n

and

<p><span>Tester</span> \n
<span>1.11. ext</span></p>\n

and

<p><span>Tester</span> <span>1.11. ext</span></p>\n
<p><span>Test</span> <span>1.12. ext</span></p>\n

are all displayed the same way in the browser, i.e. a regular space in between Tester 1.11. Interestingly Jsoup parses the second one as a single regular space while the first is a regular line feed. But for the first case a space is required while for the second case Jsoup already did its work perfectly. For the third scenario I don't need the white space which is induced by the linefeed. This causes some real trouble for transpiling.

For coding logic I have my troubles how to distinguish When is an empty TextNode OK and when NOK?

Does it make my problem more clear?

leomayer avatar Jun 22 '22 06:06 leomayer

In the parse to DOM (which creates TextNodes and all other Nodes), jsoup does not really normalize any spaces. It holds the original spaces and newlines from the original source. So, you can inspect those in code and do as you need.

On output serialization to HTML, if pretty-printing is on, jsoup does normalize spaces, and will elide non-significant whitespace in some cases, and insert newlines & padding in others.

jhy avatar Jan 06 '23 21:01 jhy