Jsoup.connect(url.toString()).get() returns incomplete documents?
Hello!
I'm debugging a mysterious issue where Jsoup.connect(url.toString()).get() will return a document terminating abruptly in the middle. It will wrap up all open tags but otherwise the body is not complete. I've wrote a simple reproducer
public class UrlRider {
public static void main(String[] args) throws IOException {
for (String uri : Files.readAllLines(Path.of("all-paths.txt"))) {
System.err.println(uri);
URL url = new URL("https://repo1.maven.org/maven2/" + uri);
HttpsURLConnection conns = (HttpsURLConnection) url.openConnection();
conns.addRequestProperty("Accept-Encoding", "gzip");
try (InputStream gis = new GZIPInputStream(conns.getInputStream());
BufferedReader br = new BufferedReader(new InputStreamReader(gis, StandardCharsets.UTF_8))) {
List<String> lines = br.lines().toList();
System.err.println(lines.subList(Math.max(0, lines.size() - 4), lines.size()));
}
List<String> list = new BufferedReader(new StringReader(Jsoup.connect(url.toString()).get().toString()))
.lines().toList();
System.err.println(list.subList(Math.max(0, list.size() - 4), list.size()));
}
}
}
Example input:
org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6/
org/apache/cxf/cxf-tools-wsdlto-frontend-javascript/4.0.6/
org/apache/cxf/systests/cxf-systests-rs-sse-base/4.0.6/
# may add more versions for better reproducibility
The output is like this:
org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6/
[ <hr/>, </body>, , </html>]
[<a href="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.sha1" title="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.sha1">cxf-rt-rs-service-description-common-openapi-...</a> 20</pre>, </main>, </body>, </html>]
org/apache/cxf/cxf-tools-wsdlto-frontend-javascript/4.0.6/
[ <hr/>, </body>, , </html>]
[ </main>, <hr>, </body>, </html>]
Note how the second one is correct, ending in <hr>, but the first one is incorrect.
This is both with http proxy and without, and on a wide range of JVM versions. I'm genuinely confused. As a reference I'm using the same URLConnection which gets the content right every time.
Seen in both 1.18.3 and 1.21.2
I've just reproduced that issue on a completely different box with different network connection and JVM setup.
I wonder if I'm missing something stupid like Document.readAll()?
Hi there,
I can't repro it. Here's a simple example:
import org.jsoup.*;
import org.jsoup.nodes.Document;
class MavenDirCheck {
public static void main(String[] args) throws Exception {
String url = "https://repo1.maven.org/maven2/org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6/";
Document doc = Jsoup.connect(url).get();
String html = doc.html();
System.out.println("chars=" + html.length());
System.out.println(html);
}
}
Which produces the correct:
chars=6087
<!doctype html>
<html>
<head>
<title>Central Repository: org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6</title>
(snip)
<a href="cxf-rt-rs-service-description-common-openapi-4.0.6.pom.sha1" title="cxf-rt-rs-service-description-common-openapi-4.0.6.pom.sha1">cxf-rt-rs-service-description-common-openapi-...</a> 2024-12-03 16:46 40
</pre>
</main>
<hr>
</body>
</html>
I was a bit confused about the rest of your example with manual gzip handling etc; you could simplify that all just by using jsoup's core Connection implementation that already implements gzip and appropriate buffered reads.
If you have a URL example that the above snippet does truncate, please provide that. You can also directly inspect the Response objects for the pre- and post-parse data to further debug.
There is a maxBodySize setting that is by default 2MB and you can set to 0 to allow infinite. But from the code and URL example you showed, that doesn't seem to be the issue.
On the first run of your program I am getting no <hr>:
<!doctype html>
<html>
<head>
<title>Central Repository: org/apache/cxf/cxf-rt-rs-service-description-common-openapi/4.0.6</title>
(snip)
<a href="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.md5" title="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.md5">cxf-rt-rs-service-description-common-openapi-...</a> 2024-12-03 16:46 32
<a href="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.sha1" title="cxf-rt-rs-service-description-common-openapi-4.0.6.jar.sha1">cxf-rt-rs-service-description-common-openapi-...</a> 20</pre>
</main>
</body>
</html>
Though it is not guaranteed to happen every time. You may need to re-run it a few times perhaps.
URL + GZIP is a reference implementation which returns correct result every time, making me think it is not a network stack issue, not Java URLConnection issue, but jsoup.
OK I think I have a repro based on how the server doesn't return a charset in the header and we try to fast-path the detection parse. May be an an issue in how we are treating isAvailable(). Will dig in a bit more.
Ok, I believe I have found and fixed the root cause.
These Maven directory pages do not send a charset header, so jsoup tries to be efficient about issuing a small read to detect the charset. It caps the response stream to the first ~5KB, does a quick UTF-8 parse, and reuses that parse if the underlying stream appears fully read. Our buffering layer could read past the 5KB cap into its 8KB buffer, and a timing-dependent available() (0 vs >0) would sometimes trigger an extra read that hit EOF and flipped baseReadFully() true. We then reused the capped pre-parse and silently dropped the tail of the document (e.g. the <hr>).
Corrected the fill() code and refactored how we track marks and capacity remaining; and added some better coverage which would fail without the fix.
Also I reran the above code in a loop for 100x and confirmed that I received the full data each time.
@alamar can you go ahead and test this and confirm it's corrected?
I can no longer see the issue with the fix - much appreciated!
Great, thank you for the report and the confirmation. This one was pernicious so I'm glad we've found it