Remove ISO-8859-1 charset fallback
For a long time we've had a fallback value in response.encoding of ISO-8859-1, because RFC 2616 told us to. RFC 2616 is now obsolete, replaced by RFCs 7230, 7231, 7232, 7233, 7234, and 7235. The authoritative RFC on this issue is RFC 7231, which has this to say:
The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says.
The media type definitions for text/* are most recently affected by RFC 6657, which has this to say:
In accordance with option (a) above, registrations for "text/*" media types that can transport charset information inside the corresponding payloads (such as "text/html" and "text/xml") SHOULD NOT specify the use of a "charset" parameter, nor any default value, in order to avoid conflicting interpretations should the "charset" parameter value and the value specified in the payload disagree.
I checked the registration for text/html here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.
I propose the following changes:
- Remove the ISO-8859-1 fallback, as it's no longer valid (being only enforced by RFC 2616). We should definitely do this.
- Consider writing a module that has the appropriate fallback encodings for other
text/*content and use them where appropriate. This isn't vital, just is a "might be nice". - Begin checking HTML content for meta tags again, in order to appropriately fall back. This is controversial, and we'll want @kennethreitz to consider it carefully.
Remove the ISO-8859-1 fallback, as it's no longer valid (being only enforced by RFC 2616). We should definitely do this.
I agree that we should remove the fallback. I do wonder how we should handle Response#text in the event that the server does not specify a charset (in anyway, including the meta tags of the body). Should we disable Response#text conditionally either through an exception or something else? Not doing so will rely more heavily on chardet, which I have decreasing confidence in given the number of new encodings it does not detect.
Consider writing a module that has the appropriate fallback encodings for other text/* content and use them where appropriate. This isn't vital, just is a "might be nice".
Given that this is not guaranteed to be included in requests, I'm fine with adding it to the toolbelt, that said. I'm also okay with making this a separate package so users can just use that with out having to install the rest of the toolbelt. That, however, is a separate discussion.
Begin checking HTML content for meta tags again, in order to appropriately fall back. This is controversial, and we'll want @kennethreitz to consider it carefully.
We still have a method to do this in utils, right? I don't like the idea in the slightest, but it won't cost extra effort. That said, we have to make sure any charset provided in the media type takes precedence.
Upon reading more int RFC 7231, specifically Section 3.1.1.5 I think the third option should ideally be opt-in, not opt-out. My specific reasoning for this is:
Clients that do so [examine a payload's content] risk drawing incorrect conclusions, which might expose additional security risks (e.g., "privilege escalation").
Taken from the same section I linked above.
Agreed from a correctness perspective, but I wonder if @kennethreitz is going to want it from a usability perspective.
I wonder how easy it would be to prop up a simple app to demonstrate the security risks involved to give a concrete example why not to do it.
If only 1.) is going to be implemented, i guess r.encoding = None and requests will use chardet?
Correct. =)
That's how it works now, so I don't think we'd change that.
I wonder how easy it would be to prop up a simple app to demonstrate the security risks involved to give a concrete example why not to do it.
Look up all the UTF-7 XSS attacks. (None work in any current browser, as everyone simply dropped UTF-7 sniffing — and most UTF-7 support entirely — to avoid making sites thus vulnerable.)
In a very real sense, using chardet is worse than option three above — it will make different conclusions to what any implementation following a specification defining how to sniff the content would (and both HTML and XML provide such a specification). The only safe thing to do is if you don't know how to determine the charset is to not try. You can probably support the vast majority of users by implementing the (standardised) HTML and XML character encoding detection algorithms.
I checked the registration for text/html here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.
Hmm, the registration (which is included inline in the HTML spec) contradicts the HTML spec itself — per the HTML spec, UTF-8/UTF-16 BOMs are given precedence over the MIME type. I've filed bug 26100 for that.
Hmmm....
http://html5.org/tools/web-apps-tracker?from=8723&to=8724 fixes the IANA registration in the HTML spec to match the body of it. It now reads:
The charset parameter may be provided to specify the document's character encoding, overriding any character encoding declarations in the document other than a Byte Order Mark (BOM). The parameter's value must be one of the labels of the character encoding used to serialise the file. [ENCODING]
Hello,
I just got hit by this reading XML files which were encoded as UTF8. On OSX the content type was being returned as 'application/xml' but on linux it was set to 'text/xml' therefore the requests lib assumed its default encoding of 'ISO-8859-1' as 'text' was in the content. Most XML files will be encoded in UTF8 so setting the encoding as 'ISO-8859-1' for 'text/xml' content is surely wrong as discussed.
Just because an RFC specifies something, doesn't mean we should do it. Especially if it makes the code crazy.
I believe that our current behavior is elegant and actually works quite effectively. Is this not the case?
As always, what does Chrome do?
Chrome introspects the HTML, a position we've always decided we don't want to do. We could optionally add support for hooks to do content-type specific encoding heuristics if we wanted. We already kinda do that for JSON, it might not hurt to do it more generally for other content types.
Grumble.
Note the HTML case is even worse than that, really. Because the pre-scan in browsers just looks at the first 1024 bytes or there abouts, and the parser itself can then change it.
I still feel like our current behavior is a great implementation.
Ok. =)
How about I knock up a proposal for smarter encoding heuristics, that takes certain known content-types and attempts to gently introspect them for their encodings. At least then we'll have something concrete to discuss.
@Lukasa @kennethreitz
Hey guys, for the time being, I don't think we have a obvious solution yet, but can we at least make this ISO-8859-1 optional?
if 'text' in content_type:
return 'ISO-8859-1'
This looks way too brutal. Some parameters like Session(auto_encoding=False) would be nice. What do you guys think?
@est ISO-8859-1 is optional, you can simply set response.encoding yourself. It's a one-line change. =)
@Lukasa But you can't determine whether the initial response.encoding came from the Content-Type header or whether it's the ISO-8859-1 fallback, which means if you want to avoid the fallback you have to start parsing the Content-Type header yourself, and that's quite a lot of extra complexity all of a sudden.
@gsnedders Sure you can. if 'charset' in response.headers['Content-Type'].
While yes, that works under the assumption the other party is doing something sane, it doesn't work in the face of madness and something like text/html; foo=charset.
@gsnedders Try if 'charset=' in response.headers['Content-Type']. At this point we're out of 'crazy' and into 'invalid formatting'.
@Lukasa uh, I thought there was whitespace (or rather what 2616 had as implicit *LWS) allowed around the equals, apparently not.
The grammar appears to be:
media-type = type "/" subtype *( OWS ";" OWS parameter )
type = token
subtype = token
The type/subtype MAY be followed by parameters in the form of
name=value pairs.
parameter = token "=" ( token / quoted-string )
So I guess the only issue here is something like text/html; foo="charset=bar".
FWIW, html5lib's API changes are going to make the correct behaviour with requests require something like:
r = requests.get('https://api.github.com/events')
e = response.encoding if 'charset=' in response.headers['Content-Type'] else None
tree = html5lib.parse(r.content, transport_encoding=e)
That seems reasonable. =)
FWIW, in requests 3.0.0 we'll reconsider this, or at least consider adding some way of working out what decision we made.
Interesting discussion.
I still feel like our current behavior is a great implementation.
If I may, a counterexample. I use requests to extract the <title> from a requested document, and here is facebook.com.
>>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> r.encoding
'ISO-8859-1'
>>> import re
>>> m = re.search('<title[^>]*>\s*(.+?)\s*<\/title>', r.text, re.IGNORECASE|re.MULTILINE)
>>> m.group(1)
u'Mathias Sundin - M\xc3\xa5nga roliga samtal mellan Tony Blair och... | Facebook'
>>> r.headers['Content-Type']
'text/html'
Apparently they don't specify the encoding in their headers. But they do in the HTML (<meta charset="utf-8" />, full example at https://gist.github.com/dentarg/f13ef7cc0ce55753faf6).
As mentioned in #2161, "requests aren't a HTML library", and I can understand that point of view. Perhaps I should be looking into parsing the HTML with some library that also can take the specified encoding into consideration.
Thank you for your work on requests. I'm looking forward to 3.0.0.
@dentarg That's not really a counter example: it's an example of why this system works.
The guidance from the IETF is that for all text/* encodings, one of the following things MUST be true:
- the peer MUST send a
charsetin the content-type - the content MUST carry an encoding specifier in it (HTML, XML)
Requests' default behaviour here works well: in the case of XML and HTML, ISO-8859-1 is guaranteed to safely decode the text well enough to let you see the <meta> tag. Anyone working with HTML really should go looking for that tag, because servers almost never correctly advertise the content type of the HTML they deliver, but the HTML itself is usually (though sadly not always) right.
The behaviour requests has right now is probably less prone to failure with HTML than the one proposed for 3.0.0, and part of me does wonder if we should try to solve this more by documentation than by code change.
Yes, perhaps documentation is the way forward.
I can share my experience. I'm a very new user of requests, and I haven't looked at all the documentation for requests, but I have looked some. The requests website have this in the code snippet on the front page:
>>> r.encoding
'utf-8'
That information, combined with me somehow (maybe from browsing issues here on GitHub) finding out that requests uses chardet, gave me the impression that requests would solve the charset/encoding problem for me "all the way".
Once I found the right documentation, it was straightforward to workaround the limitations with another library. I can fully understand that requests just want to be the HTTP parser, not the HTML parser.
All that said, let me share some short examples where I think the ISO-8859-1 fallback might cause some unexpected behavior.
>>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> from bs4 import BeautifulSoup
You can't use r.text:
>>> print BeautifulSoup(r.text, "html5lib", from_encoding="").title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook
>>> print BeautifulSoup(r.text, "html5lib", from_encoding=r.encoding).title.text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 215, in __init__
self._feed()
File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 239, in _feed
self.builder.feed(self.markup)
File "/usr/local/lib/python2.7/site-packages/bs4/builder/_html5lib.py", line 50, in feed
doc = parser.parse(markup, encoding=self.user_specified_encoding)
File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 236, in parse
parseMeta=parseMeta, useChardet=useChardet)
File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 89, in _parse
parser=self, **kwargs)
File "/usr/local/lib/python2.7/site-packages/html5lib/tokenizer.py", line 40, in __init__
self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
File "/usr/local/lib/python2.7/site-packages/html5lib/inputstream.py", line 144, in HTMLInputStream
raise TypeError("Cannot explicitly set an encoding with a unicode string")
TypeError: Cannot explicitly set an encoding with a unicode string
You need to use r.content, but if requests have fallen back to ISO-8859-1, r.encoding will cause trouble:
>>> print BeautifulSoup(r.content, "html5lib", from_encoding=r.encoding).title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook
Working as intended:
>>> print BeautifulSoup(r.content, "html5lib", from_encoding="").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook
>>> print BeautifulSoup(r.content, "html5lib", from_encoding="utf-8").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook
So here we hit a problem, which is that we can't really document the way Requests interacts with every possible tool you may want to use it with: there are just too many possibilities! So instead we try to provide general documentation.
The best advice I can give is that the more specific the tool, the more likely it can handle bytes as an input and DTRT with them. BeautifulSoup is a HTML tool and so can almost certainly take a stream of arbitrary bytes and find the relevant meta tag (and indeed, it can), so you can just pass it r.content. The same would be true if you were passing it to an XML library, or to a JSON library, or to anything else like that.
Requests' default behaviour here works well: in the case of XML and HTML, ISO-8859-1 is guaranteed to safely decode the text well enough to let you see the
<meta>tag.
Sadly, that isn't true, because of UTF-16. In both cases, they should be able to handle BOMs. Furthermore, a conforming XML parser should be able to handle a UTF-16 encoded <?xml version="1.0" encoding="UTF-16"> with no preceding BOM.