encoding icon indicating copy to clipboard operation
encoding copied to clipboard

Amount of bytes to sniff for encoding detection

Open JinsukKim opened this issue 7 years ago • 20 comments

This issue is kind of spinned off of https://github.com/whatwg/encoding/issues/68 to have a bug dedicated for the specific issue about the data size to be used for encoding detection, which was raised in the thread.

Currently, Blink feeds encoding detector the first chunk of data it gets from network to guess the text encoding used in a give document. The size varies, depending on external conditions like network configuration, and it is possible that the detected encoding result also also vary. I'm experimenting the idea discussed in https://bugs.chromium.org/p/chromium/issues/detail?id=691985#c16 so that always the same amount of data for a document will be fed to encoding detector in order to get consistent result regardless of other conditions.

The size I'm thinking of is 4K though 1K was initially suggested in the Chromium bug entry. Size of the entire document will be used if it is smaller than that. Are there any values used in other browsers for a reference? I prefer more than 1K for encoding detection since in some anecdotal observations I made, documents have ascii-only tags and scripts in the first 1K, which are not enough clue for the detector to make a right guess with.

FWIW, Blink looks at the entire <head> for a charset meta tag, or scans up to first 1024 bytes to find one even past <head>. There must have been a historical reason behind that I'm not aware of.

JinsukKim avatar Apr 25 '17 01:04 JinsukKim

Expecting inputs from @hsivonen @annevk, etc.

JinsukKim avatar Apr 25 '17 01:04 JinsukKim

I think Firefox uses at 1024 bytes and I think that is what we should aim for anything where we need to resort to sniffing (it seems like that discussion isn't over yet either; when we need to sniff and for which encodings).

The more bytes you wait for the longer it takes for the user to see anything, especially on slower connections. (Although I guess you can still start decoding ASCII bytes in some generic decoder maybe as long as you don't get special ISO-2022-JP bytes wrong, since we should not sniff for UTF-16. (Or should we sniff for UTF-16? I believe WebKit does it, but only for the first couple of bytes so that would not matter here.))

annevk avatar Apr 25 '17 08:04 annevk

For what it's worth, the spec kind of documents this: https://html.spec.whatwg.org/multipage/syntax.html#encoding-sniffing-algorithm

Optionally prescan the byte stream to determine its encoding. The end condition is that the user agent decides that scanning further bytes would not be efficient. User agents are encouraged to only prescan the first 1024 bytes. User agents may decide that scanning any bytes is not efficient, in which case these substeps are entirely skipped.

domenic avatar Apr 25 '17 14:04 domenic

I think we should do 1024 bytes for consistency with meta even though it gives relatively little opportunity for sniffing when there's no non-ASCII in title and a lot of scripts/styles. The number 1024 comes from WebKit legacy, IIRC.

Firefox had the 1024-byte limit for content-based sniffing during the Firefox 4 development cycle. However, that was known to break one site: Japanese Planet Debian (EUC-JP, but these days properly declared). Decision-makers postulated that the one site would only a tip of an iceberg, so I had to change the detector to run on more bytes.

So today, it the Japanese detector is enabled, it runs on the first 1024 bytes before the parse starts. Then in continues to run, and if it revises its guess during the parse, the page in re-navigated to with the newly-guessed encoding.

I guess we should gather telemetry to see how often this happens.

hsivonen avatar May 05 '17 09:05 hsivonen

Why would consistency with be <meta> be a goal here? I can't think of an argument for a concrete benefit of such consistency.

<meta> is a header thing whereas the bulk of bytes revealing the encoding are going to be part of <body>. The only guaranteed-useful part of the header is <title>, but it may well end up near the end of the header, or be too short for reliable detection. So, even in theory, it makes sense to choose a different constant for encoding detection. I think the principled target for the constant should be "larger than typical header length".

Then in continues to run, and if it revises its guess during the parse, the page in re-navigated to with the newly-guessed encoding.

This sounds complex and bug-prone, so I don't think we would be willing to introduce similar behavior in Chromium. I would much rather increase the constant than resort to this.

I guess we should gather telemetry to see how often this happens.

Yes, I would be very interested in hearing the results of telemetry.

alexelias avatar May 08 '17 19:05 alexelias

Why would consistency with be <meta> be a goal here? I can't think of an argument for a concrete benefit of such consistency.

To rely on the Firefox experience that waiting for 1024 without a timeout doesn't break the Web. (I.e. there have been only a couple of complaints since Firefox 4 about scripts that intentionally send content piecemeal but unintentionally not having an early meta not having their initial output shown.)

Also, if the buffer size for byte pattern detection was greater than the buffer size for meta sniffing, there'd be a temptation to change meta sniffing to operate on the larger buffer, too, because that would then make perfect sense from a Chromium-only perspective, but then that would break the browser consensus for meta, and it would be hard to walk that back if Chromium later changed its mind and decided that byte pattern detection isn't as necessary after all as Chromium devs now suggest.

(On the topic of assessing necessity, it would be good to hear from Edge devs what exactly, if anything, they do about byte pattern sniffing considering that they don't have a manual override menu as a recourse, either.)

I would much rather increase the constant than resort to this.

Increasing the constant makes the probability of breakage go up for pages that intentionally stall the TCP stream. Adding a timeout to address that concern has the problem of the relationship between the input bytes to DOM tree becoming timing-dependent.

hsivonen avatar May 09 '17 08:05 hsivonen

Increasing the constant makes the probability of breakage go up for pages that intentionally stall the TCP stream

OK, this is the argument that makes the most sense me for holding to 1024. Do you know why some pages do this and what signal they are waiting for to unstall the stream?

alexelias avatar May 09 '17 18:05 alexelias

I'm planning to get some stats to understand the impact of switching the size to 1024 (i.e. documents that give different encoding guesses to the input size. 1024 vs. 2048, 4096. etc).

JinsukKim avatar May 10 '17 01:05 JinsukKim

Ran the detector over substantial amount of unlabelled documents and got following stat:

input size     coverage (%)
     1K         84.36
     2K         92.86
     3K         96.28
     4K         98.60

(The rest 1.40% is exception)

84.36% of the unlabelled HTML documents return the correct text encoding when their first 1K is fed to the detector (meaning the guessed encoding remained the same even if we gave a bigger chunk like 2K ~ 4K). Quite big but it doesn't look like a decisive number either.

JinsukKim avatar May 19 '17 04:05 JinsukKim

One sixth of detections failing seems like a high error rate, so I think this data indicates we should standardize the number to 4096.

For the slippery-slope argument that Chromium might eventually increase meta tag scanning number to 4096 as well, we'd be happy to add a unit test (perhaps web platform test?) verifying that meta tag scan fails if past 1024 bytes.

alexelias avatar May 19 '17 06:05 alexelias

I'd like some more detail on this detector. I assume the goal is to avoid reloading, so would we essentially stop the parser if we hit non-ASCII in the first 4k and then wait for the 4k or end-of-file to determine the encoding for the remainder of the bytes?

Does that also mean we'd never detect ISO-2022-JP or UTF-16, which are ASCII-incompatible?

Does it only run as a last resort?

I also know that Henri has been exploring alternative strategies for locale-specific fallback. Rather than using the user's locale, use the locale inferred from the domain's TLD. It would be interesting to know how that contrasts with a detector.

annevk avatar May 19 '17 07:05 annevk

You can see the API to the CED detector in https://github.com/google/compact_enc_det/blob/master/compact_enc_det/compact_enc_det.h .

we essentially stop the parser if we hit non-ASCII in the first 4k and then wait for the 4k or end-of-file to determine the encoding for the remainder of the bytes? [...] Does it only run as a last resort?

My thinking (and perhaps we can spec this) is that if the algorithm would go:

  • If the charset is specified in the HTTP header or in the first 1024 bytes, then we take that as gospel and not run the detector at all.
  • Failing that, we would run the content detector on the first 1024 bytes. If the detector returns is_reliable as true, we stop there and reparse just as if the meta charset had been set.
  • Failing that, we wait for 4096 bytes and run the content detector on them (without feeding them into the parser, and without populating meta_charset_hint even if it's present in these 4096 bytes). Whether or not is_reliable is true, we treat this as the true final encoding. We reparse if needed with the detected encoding for this and never run the detector on any further bytes.

Does that also mean we'd never detect ISO-2022-JP or UTF-16, which are ASCII-incompatible?

I think you meant to say UTF-7? We have the choice, we can set ignore_7bit_mail_encodings parameter to exclude them (and I think we should because of security concerns).

Rather than using the user's locale, use the locale inferred from the domain's TLD. It would be interesting to know how that contrasts with a detector.

I agree TLD is an important signal. The CED detector already takes in a "url" parameter of which it primarily looks at the TLD. It uses it as a tiebreaker in cases where the content encoding is ambiguous. The percentages in the data above already take the URL into consideration, so the 84% success rate represents the best we can do given the first 1024 bytes plus the TLD.

alexelias avatar May 19 '17 19:05 alexelias

We reparse if needed with the detected encoding for this and never run the detector on any further bytes.

Earlier in the sentence you said you wouldn't feed things into the parser so this wouldn't apply, right?

think you meant to say UTF-7?

No, UTF-7 must obviously not be detected. It's not an encoding to begin with per the Encoding Standard.

It's still unclear to me what motivated the CED detector as other browsers don't have a similar complex thing. Also looking through the code of the CED detector it doesn't seem aligned with the Encoding Standard at all. https://github.com/google/compact_enc_det/blob/master/util/encodings/encodings.cc lists very different encodings and labels.

annevk avatar May 20 '17 05:05 annevk

Earlier in the sentence you said you wouldn't feed things into the parser so this wouldn't apply, right?

That's correct - there won't be reparsing. Feeding will be delayed till encoding gets detected.

Also looking through the code of the CED detector it doesn't seem aligned with the Encoding Standard at all.

CED can be (and is) configured to conform to the standard (HTML5 mode).

Users have to switch encoding via menu or plugin manually when viewing legacy web pages without encoding label. Many of them have no idea what the concept of encoding is, how to get the page right, or which encoding to choose. CED tries to address that by making the best guess for those pages.

JinsukKim avatar May 22 '17 00:05 JinsukKim

There is no standard for encoding detection though. And the description at https://github.com/google/compact_enc_det/blob/master/util/encodings/encodings.cc#L835 for encoding labels goes against https://encoding.spec.whatwg.org/ and therefore also against HTML. It's not clear to me what of that ends up being exposed to Chrome though, maybe it's not? But we'd really need a much more detailed description or set of test cases to know what exactly Chrome has unilaterally added to the web platform.

annevk avatar May 22 '17 11:05 annevk

This will give you an idea how CED works in web standard-compliant way: https://github.com/google/compact_enc_det/blob/master/compact_enc_det/compact_enc_det.cc#L5649

It's still unclear to me what motivated the CED detector as other browsers don't have a similar complex thing.

I think the better thread for this question is https://github.com/whatwg/encoding/issues/68

JinsukKim avatar May 23 '17 00:05 JinsukKim

encoding labels goes against https://encoding.spec.whatwg.org/ and therefore also against HTML. It's not clear to me what of that ends up being exposed to Chrome though, maybe it's not?

The encoding name strings in CED aren't web-exposed in Chrome. They didn't become valid meta tag encoding names or anything.

Does that also mean we'd never detect ISO-2022-JP or UTF-16, which are ASCII-incompatible?

We are currently detecting ISO-2022-JP and UTF-16. We'll make another triage pass and come up with a minimal whitelist of necessary ones (which almost certainly shouldn't contain UTF-16 at least).

It's still unclear to me what motivated the CED detector as other browsers don't have a similar complex thing.

What motivated it is that the encoding selection menu other browsers have isn't a workable solution on phones where UI space is limited. Also, Firefox, IE and older versions of Chrome do have autodetectors, they're just not enabled by default.

alexelias avatar May 23 '17 01:05 alexelias

What motivated it is that the encoding selection menu other browsers have isn't a workable solution on phones where UI space is limited.

AFAICT, Edge has no such menu. How what detection, if any, does Edge have? Does Safari on iOS have a menu? (I don't have a iDevice at hand to test with.) What detection does Safari (Mac and iOS) have?

hsivonen avatar May 23 '17 07:05 hsivonen

Safari on iOS doesn't have a menu as far as I know. It appears they use the ICU detector, though I haven't written tests to verify: https://github.com/WebKit/webkit/blob/master/Source/WebCore/platform/text/TextEncodingDetectorICU.cpp. (It seems that's a contribution from Google...)

annevk avatar May 23 '17 09:05 annevk

In the non-file: URL case for unlabeled content, since Firefox 73, when the meta prescan fails at 1024 bytes, Firefox feeds those 1024 bytes to chardetng, asks chardetng to make a guess, and then starts parsing HTML based on that guess. All the rest of the bytes ares still fed to chardetng in addition to being fed to the HTML parser. When the EOF is reached, chardetng guesses again. If the guess differs from the previous guess, the page is reloaded with this second guess applied.

chardetng does not run on the .in, .lk, and .jp TLDs. The two former don't trigger a detector at all, and .jp uses a special-purpose detector discussed in another issue.

As for Safari, I didn't find any evidence of the ICU-based detector mentioned in the previous comment being in use in Safari on Mac. However, the Japanese-specific detector in WebKit appears to be in effect when Safari's UI language is set to Japanese.

hsivonen avatar Jun 08 '20 12:06 hsivonen