jsoup [Fix Issue #1070] Jsoup parse string not skipping BOM character correctly

Issue #1070 shows a case where Jsoup not parsing an HTML page correctly and putting all the information in the body instead of head. The problem can be reproducible only if use Jsoup.parse(response.body()). Using another method Jsoup.connect(url).get() won't reproduce the same problem.

After digging around for a while, it turns out that the HTML page has the BOM character 65279 in its first character. The reason that the second method, Jsoup.connect(url).get(), internally uses DataUtil.parseInputStream(), which handles the BOM character correctly. However, Jsoup.parse(string) constructs a HTMLTreeBuilder to parse the document directly, therefore the BOM handling process is not there.

This PR fixes this issue by introducing the same handling process in TreeBuilder constructor. Before it passes the input reader to the tokenizer, it reuses the same helper function DataUtil.detectCharsetFromBom to detect the BOM character, and skip the BOM character if needed.

This PR also adds a new test case in DataUtilTest to test if Jsoup.parse(string) can work correctly with BOM. After adding the fix, the problem in issue #1070 can produce the correct result.

There are already several efforts to fix the problem caused by the BOM character: Issue #348 is the first issue mentioning the problem, and it is fixed in commit https://github.com/jhy/jsoup/commit/3f9f33d88355f22aefc7ea402da09fd1950289ce , this commit only fixes the issue in parseByteData (which later becomes parseInputStream).

Later there are several commits to refactor the BOM handling code, such as https://github.com/jhy/jsoup/commit/c3cbe1b64e7f66ff9f9b53f1388eb135e6693187 , https://github.com/jhy/jsoup/commit/4eb4f2b2e88a2f9e6c5c1e8d0477060954f24218

The latest issue mentioning the problem is Issue #1003, and the corresponding fix is in https://github.com/jhy/jsoup/commit/0f7e0cc373aced32629f5321c9521f81d8248647 , however, this commit still only fixes the function parseInputStream, which is not used by Jsoup.parse(string).

@jhy I'm not sure if TreeBuilder class is the best place to put in the fix code. I am willing to devote more efforts to improve it if you could give some more advice. Thanks :)

Jun 12 '18 21:06 zuozhiw

Any idea when this could be merged ?

Jul 25 '18 08:07 panthony

Using this fix I eventually encountered a nasty issue that made me lost a fair amount of time.

Long story short, I had a piece of code that sometimes where running on a machine that had US-ASCII as default charset instead of UTF-8.

Because of this, the BOM was badly handled was it relies on Charset.defaultCharset().

This is often bad practice to rely on default Locale/Charset, maybe JSoup could take a look at https://github.com/policeman-tools/forbidden-apis (story behind this project: http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html)

Dec 13 '18 10:12 panthony

Hi, thanks for this. I am worried that this only handles the UTF8 BOM, and probably only works if the string was actually decoded as a UTF8 string. It doesn't help (afaict) for UTF16 and 32 BOMs.

My thought is that we should implement so that we expect any input String to have been correctly decoded.

The methods in Connection and Jsoup.load (for file and input stream) correctly handle the different BOMs. The user is being impacted for getting the request body string and that's where the BOM is not handles (and the string is invalid effectively). How about we add the BOM routine to the body() method?

Dec 27 '18 03:12 jhy

Not sure which method we are talking about but you cannot convert back the given string to bytes without taking which charset it was since you cannot rely on defaultCharset.

In my case, if JSoup.parse were to take an InputStream instead of a String I could just wrap it inside a DOMInputStream (Apache commons-io).

Maybe this could just be documented:

This is the connection layer that handles the BOM implicitly
The String input must be cleared from the BOM markup (only the user can do this since he's the only one to know how it is encoded)
The JSoup could provide an helper that takes a the necessary information (string, charset) to cleanup the string if necessary

Dec 27 '18 07:12 panthony