jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Jsoup.connect(url.toString()).get() returns incomplete documents?

Open alamar opened this issue 2 weeks ago • 1 comments

Hello!

I'm debugging a mysterious issue where Jsoup.connect(url.toString()).get() will return a document terminating abruptly in the middle. It will wrap up all open tags but otherwise the body is not complete. I've wrote a simple reproducer

public class UrlRider {
    public static void main(String[] args) throws IOException {
        for (String uri : Files.readAllLines(Path.of("all-paths.txt"))) {
            System.err.println(uri);
            URL url = new URL("https://repo1.maven.org/maven2/" + uri);
            HttpsURLConnection conns = (HttpsURLConnection) url.openConnection();
            conns.addRequestProperty("Accept-Encoding", "gzip");
            try (InputStream gis = new GZIPInputStream(conns.getInputStream());
                 BufferedReader br = new BufferedReader(new InputStreamReader(gis, StandardCharsets.UTF_8))) {
                List<String> lines = br.lines().toList();
                System.err.println(lines.subList(Math.max(0, lines.size() - 4), lines.size()));
            }
            List<String> list = new BufferedReader(new StringReader(Jsoup.connect(url.toString()).get().toString()))
                    .lines().toList();
            System.err.println(list.subList(Math.max(0, list.size() - 4), list.size()));
        }
    }
}

Example input:

org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6/
org/apache/cxf/cxf-tools-wsdlto-frontend-javascript/4.0.6/
org/apache/cxf/systests/cxf-systests-rs-sse-base/4.0.6/
# may add more versions for better reproducibility

The output is like this:

org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6/
[	<hr/>, </body>, , </html>]
[<a href="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.sha1" title="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.sha1">cxf-rt-rs-service-description-common-openapi-...</a>  20</pre>,   </main>,  </body>, </html>]
org/apache/cxf/cxf-tools-wsdlto-frontend-javascript/4.0.6/
[	<hr/>, </body>, , </html>]
[  </main>,   <hr>,  </body>, </html>]

Note how the second one is correct, ending in <hr>, but the first one is incorrect.

This is both with http proxy and without, and on a wide range of JVM versions. I'm genuinely confused. As a reference I'm using the same URLConnection which gets the content right every time.

alamar avatar Dec 11 '25 11:12 alamar

Seen in both 1.18.3 and 1.21.2

alamar avatar Dec 11 '25 11:12 alamar

I've just reproduced that issue on a completely different box with different network connection and JVM setup.

I wonder if I'm missing something stupid like Document.readAll()?

alamar avatar Dec 11 '25 19:12 alamar

Hi there,

I can't repro it. Here's a simple example:

import org.jsoup.*;
import org.jsoup.nodes.Document;

class MavenDirCheck {
    public static void main(String[] args) throws Exception {
        String url = "https://repo1.maven.org/maven2/org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6/";
        Document doc = Jsoup.connect(url).get();
        String html = doc.html();
        System.out.println("chars=" + html.length());
        System.out.println(html);
    }
}

Which produces the correct:

chars=6087
<!doctype html>
<html>
 <head>
  <title>Central Repository: org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6</title>

(snip)

<a href="cxf-rt-rs-service-description-common-openapi-4.0.6.pom.sha1" title="cxf-rt-rs-service-description-common-openapi-4.0.6.pom.sha1">cxf-rt-rs-service-description-common-openapi-...</a>  2024-12-03 16:46        40      
		</pre>
  </main>
  <hr>
 </body>
</html>

I was a bit confused about the rest of your example with manual gzip handling etc; you could simplify that all just by using jsoup's core Connection implementation that already implements gzip and appropriate buffered reads.

If you have a URL example that the above snippet does truncate, please provide that. You can also directly inspect the Response objects for the pre- and post-parse data to further debug.

There is a maxBodySize setting that is by default 2MB and you can set to 0 to allow infinite. But from the code and URL example you showed, that doesn't seem to be the issue.

jhy avatar Dec 11 '25 23:12 jhy

On the first run of your program I am getting no <hr>:

<!doctype html>
<html>
 <head>
  <title>Central Repository: org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6</title>

(snip)

<a href="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.md5" title="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.md5">cxf-rt-rs-service-description-common-openapi-...</a>  2024-12-03 16:46        32      
<a href="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.sha1" title="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.sha1">cxf-rt-rs-service-description-common-openapi-...</a>  20</pre>
  </main>
 </body>
</html>

Though it is not guaranteed to happen every time. You may need to re-run it a few times perhaps.

alamar avatar Dec 12 '25 08:12 alamar

URL + GZIP is a reference implementation which returns correct result every time, making me think it is not a network stack issue, not Java URLConnection issue, but jsoup.

alamar avatar Dec 12 '25 08:12 alamar

OK I think I have a repro based on how the server doesn't return a charset in the header and we try to fast-path the detection parse. May be an an issue in how we are treating isAvailable(). Will dig in a bit more.

jhy avatar Dec 12 '25 12:12 jhy

Ok, I believe I have found and fixed the root cause.

These Maven directory pages do not send a charset header, so jsoup tries to be efficient about issuing a small read to detect the charset. It caps the response stream to the first ~5KB, does a quick UTF-8 parse, and reuses that parse if the underlying stream appears fully read. Our buffering layer could read past the 5KB cap into its 8KB buffer, and a timing-dependent available() (0 vs >0) would sometimes trigger an extra read that hit EOF and flipped baseReadFully() true. We then reused the capped pre-parse and silently dropped the tail of the document (e.g. the <hr>).

Corrected the fill() code and refactored how we track marks and capacity remaining; and added some better coverage which would fail without the fix.

Also I reran the above code in a loop for 100x and confirmed that I received the full data each time.

@alamar can you go ahead and test this and confirm it's corrected?

jhy avatar Dec 13 '25 05:12 jhy

I can no longer see the issue with the fix - much appreciated!

alamar avatar Dec 13 '25 09:12 alamar

Great, thank you for the report and the confirmation. This one was pernicious so I'm glad we've found it

jhy avatar Dec 13 '25 11:12 jhy