html5ever Implement character encoding detection / conversion

See the HTML spec and the WHATWG Encoding spec. This also entails noticing <meta charset=...> and <meta http-equiv="Content-Type"> as we parse.

Jul 31 '14 20:07 kmcallister

rust-encoding implements the Encoding spec. We already use it in Servo for CSS.

HTMLâs prescan the byte stream to determine its encoding algorithm (the one that looks for <meta charset=...> and <meta http-equiv="Content-Type">) works on the byte stream (as the name says), before tokenization starts.

I believe html5ever should implement the encoding sniffing algorithm (of which the prescan is part of) independently of the tokenizer and parser (although they might share code internally) so that the overall parsing is, conceptually:

Determine (sniff) the encoding from bytes and metadata
Decode bytes to Unicode text (using rust-encoding) with the encoding chosen at step 1
Tokenize text to tokens
Parse tokens into a tree

(It would also be nice to have some kind of Unicode stream: https://github.com/rust-lang/rfcs/pull/57)

Sep 24 '14 19:09 SimonSapin

Per the spec, the tree builder can also reload the document with a different encoding when it encounters a relevant <meta> tag: https://html.spec.whatwg.org/multipage/#parsing-main-inhead:change-the-encoding

This seems annoying, given that html5ever is not necessarily running in Servo. @hsivonen suggests on #whatwg @ Freenode that we might get away with ignoring that part of the spec:

15:13 SimonSapin hsivonen: Do you know if there is a way around keeping unbounded amount of input in memory in case the parser decides to "change the encoding", when parsing HTML from bytes? 15:14 hsivonen SimonSapin: there intentionally is not supposed to be one 15:14 SimonSapin and, to test if browsers do it, are you aware of a case where "prescan a byte stream to determine its encoding" would fail to find an encoding when given 1024 bytes, but tree construction would still "change the encoding" based on a meta start tag in the first 1024 bytes? 15:16 SimonSapin hsivonen: intentionally? Why? 15:19 hsivonen SimonSapin: to answer the previous question: after the 1024-byte boundary, the parser instance commits to one encoding. However, a late meta or a Japanese/Russian/Ukrainian detector can still trigger a reload with a different encoding 15:20 hsivonen SimonSapin: in which case a new parser instance starts a new parse 15:20 SimonSapin hsivonen: I’m trying to decide what to do in html5ever, which doesn’t necessarily have a notion of reload 15:21 hsivonen SimonSapin: IIRC, WebKit/Blink doesn't support late triggering a reload. I don't know if they do it for their Japanese detection 15:22 hsivonen SimonSapin: as for "intentionally", the intention of the 1024-byte boundary is precisely to make sure that the parser doesn't keep buffering forever and not produce any output 15:22 hsivonen SimonSapin: I suggest committing to an encoding at the latest when you've seen 1024 bytes 15:23 hsivonen SimonSapin: I can't recall why I implemented the late thing in the new parser 15:23 hsivonen SimonSapin: initially, I make the detectors see at most 1024 bytes so that they couldn't trigger a reload 15:23 hsivonen SimonSapin: but that broke Japanese Planet Debian 15:23 hsivonen SimonSapin: and people get really nervous if you break a Japanese site 15:24 hsivonen so... 15:24 hsivonen Japanese Planet Debian has since been fixed 15:25 hsivonen it's quite possible that we could get rid of the Russian and Ukrainian detectors and limit the Japanese detector to 1024 bytes and the sky wouldn't fall 15:25 SimonSapin hsivonen: I see, thanks. So only run the byte-based prescanner, or can tree construction find meta tags that the pre-scanner doesn’t? 15:26 SimonSapin hsivonen: I’m referring to https://html.spec.whatwg.org/mult...main-inhead:change-the-encoding 15:26 hsivonen SimonSapin: I suggest only running the prescanner. (but I bet it's possible to construct something that the prescanner doesn't see but the tree builder sees) 15:27 annevk hsivonen: WebKit only has a Japanese detector iirc 15:27 hsivonen SimonSapin: oh. the reason I added support for late may be that the spec said so! 15:27 hsivonen SimonSapin: but IIRC, WebKit doesn't honor the spec there 15:27 annevk we should fix the spec 15:28 hsivonen SimonSapin: it quite possible that the spec says so because the old parser in Gecko behaved like that 15:28 hsivonen SimonSapin: I'm not sure what IE did at the time the spec was written, but my vague recollection is that it supported late 15:30 hsivonen hmm. an obvious way to create a seen by the tree builder but not by the prescanner is, of course, document.write 15:30 jgraham This wasn't one of the cases where Hixie was concerned about the security impact of an attacker that could cause early termination of the byte stream? 15:30 hsivonen jgraham: I don't recall this topic co-occurring with that topic 15:31 * jgraham isn't quite sure what such an attack would look like given incremental parsing 15:31 hsivonen jgraham: that was about comments and scripts 15:32 jgraham OK 15:32 hsivonen SimonSapin: so I suggest 1) implementing just the prescan until 1024 bytes, 2) being aware that you might end up having to implement something that allows you to signal to the browsing context to reload if # 1 Breaks the Web, 3) researching if old IE actually supports late and if it doesn't, filing a spec bug 15:33 hsivonen it's possible that a spec bug is warranted just based on the success of WebKit, though 15:33 SimonSapin hsivonen: Chrome doesn’t reaload, it switches encodings mid-stream: https://gist.github.com/anonymous/addad9f51781a6cd2cf9 15:33 SimonSapin Firefox reloads 15:33 hsivonen SimonSapin: whoa! 15:34 SimonSapin Firefox makes two HTTP requests 15:34 hsivonen SimonSapin: Firefox making two requests is expected 15:34 hsivonen SimonSapin: the Chrome behavior is news to me 15:35 SimonSapin Chrome 46 dev, don’t have Release at hand 15:37 hsivonen annevk: my current assumption is that the Russian and Ukrainian detectors misfiring is a greater problem than the problems they fix, but I don't have proof 15:38 hsivonen annevk: I want to get rid of those two detectors but I feel I need something more concrete than a guess that they have negative utility

In the test case below, Chrome 46 dev switches encodings mid-stream (but not exactly at the point of the <meta> tag) rather than reloading: it renders Ã¨, then Ã¨èé. Firefox 39 renders Ã¨, then Ã¨Ã¨, then reloads (makes a new HTTP request), renders èè, then èèé.

import time
from wsgiref.simple_server import make_server

def simple_app(environ, start_response):
    status = '200 OK'
    headers = [('Content-type', 'text/html')]
    start_response(status, headers)
    yield u"""
<!DOCTYPE html>
<!-- 1024 bytes:
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
-->
è""".encode('utf8')
    time.sleep(2)
    yield u"è<meta charset=utf-8>é".encode('utf8')


httpd = make_server('', 8000, simple_app)
print("Listening on port 8000....")
httpd.serve_forever()

Aug 03 '15 15:08 SimonSapin

@SimonSapin what happens when it's in response to a POST request?

Dec 02 '15 23:12 gsnedders

#188 added a from_bytes method to html5ever::driver::Parser that does some of this (pick an encoding from Content-Type charset or a BOM, and use rust-encoding).

It should be possible to modify the html5ever::driver::detect_encoding function to do the rest of this (in particular scanning for <meta> elements) without further API changes.

Feb 23 '16 16:02 SimonSapin

You mentioned

pick an encoding from Content-Type charset

But I tried

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=big5" />

I still got diamonds

selection_136

Aug 24 '17 04:08 shinglyu

@shinglyu Servo doesn't process meta tags yet.

Aug 24 '17 05:08 jdm

@jdm: I see, so the parser has the ability to do so but it's not hooked up in servo yet?

Aug 24 '17 06:08 shinglyu

I meant Content-Type the HTTP header. And yes, using that API in Servo never landed because tests failed because </script> handling was incorrect.

Aug 24 '17 07:08 SimonSapin

@SimonSapin Is https://github.com/servo/servo/pull/9730 what you were talking about?

Aug 24 '17 08:08 shinglyu

Yes.

Aug 24 '17 08:08 SimonSapin

I wrote a test server to have Content-Type=text/html; charset=big5 in my HTTP header. (see the screenshot from webkit devtool) selection_137

But servo still renders it incorrectly.

A dumb question: how do I break in the html5ever code using gdb from servo? I tried b driver.rs:<line no> but didn't work. I want to check if the encoding is correct, or is there other place in servo I can break to check the encoding parsing? (Looks like the document.characterSet() API is hardcoded to UTF-8?)

Aug 24 '17 08:08 shinglyu

No need for a debugger, I can confirm that current Servo always uses UTF-8 for HTML. (https://github.com/servo/servo/pull/9730 never landed.)

https://github.com/servo/servo/blob/1aacbce053/components/script/dom/servoparser/mod.rs#L346-L365

What happened with #9730 is that I first tried to build an abstraction in html5ever with a nice simplified API, and then realized it didn’t fit what Servo needs. So this time around I suggest first implementing the encoding sniffing algorithm in Servo, and then later see what kind of API we can build to move it into html5ever.

And we should use encoding_rs instead of rust-encoding, now that Gecko ships it. I have some ideas for adding encoding_rs support in Tendril.

Aug 24 '17 12:08 SimonSapin

html5ever html5ever copied to clipboard

Implement character encoding detection / conversion

html5ever
html5ever copied to clipboard