jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

jsoup 1.15.2 appears to insert new spaces

Open henricook opened this issue 3 years ago • 3 comments

Hi all,

When upgrading from 1.15.1 to 1.15.2 I appear to have encountered unexpected insertion of spaces - is this a bug or desired behaviour?

In this test with a multiline string:

<h1>This is my comment</h1>

<p>Lorem ipsum</p>

<span>Thanks</span>

1.15.1 output of val parsed = Jsoup.parse(textWithHtml):

<h1>This is my comment</h1>
<p>Lorem ipsum</p><span>Thanks</span>

1.15.2 output of val parsed = Jsoup.parse(textWithHtml):

<h1>This is my comment</h1>
<p>Lorem ipsum</p> <span>Thanks</span>

(space inserted between </p> and <span>)

EDITED: To remove references to using .text() on the output of Jsoup.parse

henricook avatar Jul 05 '22 06:07 henricook

Thanks for sharing the examples. I'm trying to reproduce this using Jsoup 1.15.2 and Java 17.0.4 since I don't have Kotlin installed on my machine and the following code snippet produces a space between ipsum</p> <span> as well when I use the Jsoup.parse() method without the text() method.

Document jsoupHtml() throws IOException {
    String multiLineHtml = """
        <h1>This is my comment</h1>

        <p>Lorem ipsum</p>

        <span>Thanks</span> """;
    Document resultingHtml = Jsoup.parse(multiLineHtml);

    return resultingHtml;
    }

//The above code produces this Html for me:

<html>
 <head></head>
 <body>
  <h1>This is my comment</h1>
  <p>Lorem ipsum</p> <span>Thanks</span>
 </body>
</html>

When I try using the text() method like you do in your example, I don't see an extra space in the final String result.

String jsoupHtml() throws IOException {
    String multiLineHtml = """
        <h1>This is my comment</h1>

        <p>Lorem ipsum</p>

        <span>Thanks</span> """;
    Document resultingHtml = Jsoup.parse(multiLineHtml);
    String textOfHtml = resultingHtml.text();
    
    return textOfHtml;
    }

The above code snippet produces the following String for me without extra spaces when using text() and System.out.println() to print the result to a Ubuntu Linux terminal:

This is my comment Lorem ipsum Thanks

Thanks for sharing your example but maybe I'm missing something when trying to reproduce this issue?

jeffthomasweb avatar Aug 05 '22 15:08 jeffthomasweb

Hi, I checked as well, I got the same as @jeffthomasweb. My Java version is: IBM Semeru Runtime Open Edition 17.0.2.0 (build 17.0.2+8)

dcremonini avatar Aug 21 '22 21:08 dcremonini

Thanks so much for your time both, I've edited my post to remove references to using .text() - it didn't line up with the HTML outputs I pasted... I'm not sure what I was smoking there.

I've created a tiny scala repro case located here, also attached as two jars in a zip (each one with a different jsoup version)

https://github.com/henricook/jsoup-1802-repro

jsoup-bug-1802-jsoup-1.15.x.zip

I'm Ubuntu / Java 11

henricook avatar Aug 22 '22 07:08 henricook

I know it looks like it, but jsoup is not inserting a space here. It is actually collapsing a newline into a single space - and would collapse multiples of those if present.

I have improved the pretty-printer to now also collapse this space, similar to the earlier behavior.

Thanks for the report!

jhy avatar Jan 06 '23 00:01 jhy

Thanks @jhy !

henricook avatar Feb 20 '23 10:02 henricook