requests icon indicating copy to clipboard operation
requests copied to clipboard

Remove ISO-8859-1 charset fallback

Open Lukasa opened this issue 11 years ago • 43 comments

For a long time we've had a fallback value in response.encoding of ISO-8859-1, because RFC 2616 told us to. RFC 2616 is now obsolete, replaced by RFCs 7230, 7231, 7232, 7233, 7234, and 7235. The authoritative RFC on this issue is RFC 7231, which has this to say:

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says.

The media type definitions for text/* are most recently affected by RFC 6657, which has this to say:

In accordance with option (a) above, registrations for "text/*" media types that can transport charset information inside the corresponding payloads (such as "text/html" and "text/xml") SHOULD NOT specify the use of a "charset" parameter, nor any default value, in order to avoid conflicting interpretations should the "charset" parameter value and the value specified in the payload disagree.

I checked the registration for text/html here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.

I propose the following changes:

  1. Remove the ISO-8859-1 fallback, as it's no longer valid (being only enforced by RFC 2616). We should definitely do this.
  2. Consider writing a module that has the appropriate fallback encodings for other text/* content and use them where appropriate. This isn't vital, just is a "might be nice".
  3. Begin checking HTML content for meta tags again, in order to appropriately fall back. This is controversial, and we'll want @kennethreitz to consider it carefully.

Lukasa avatar Jun 07 '14 06:06 Lukasa

Remove the ISO-8859-1 fallback, as it's no longer valid (being only enforced by RFC 2616). We should definitely do this.

I agree that we should remove the fallback. I do wonder how we should handle Response#text in the event that the server does not specify a charset (in anyway, including the meta tags of the body). Should we disable Response#text conditionally either through an exception or something else? Not doing so will rely more heavily on chardet, which I have decreasing confidence in given the number of new encodings it does not detect.

Consider writing a module that has the appropriate fallback encodings for other text/* content and use them where appropriate. This isn't vital, just is a "might be nice".

Given that this is not guaranteed to be included in requests, I'm fine with adding it to the toolbelt, that said. I'm also okay with making this a separate package so users can just use that with out having to install the rest of the toolbelt. That, however, is a separate discussion.

Begin checking HTML content for meta tags again, in order to appropriately fall back. This is controversial, and we'll want @kennethreitz to consider it carefully.

We still have a method to do this in utils, right? I don't like the idea in the slightest, but it won't cost extra effort. That said, we have to make sure any charset provided in the media type takes precedence.

sigmavirus24 avatar Jun 08 '14 14:06 sigmavirus24

Upon reading more int RFC 7231, specifically Section 3.1.1.5 I think the third option should ideally be opt-in, not opt-out. My specific reasoning for this is:

Clients that do so [examine a payload's content] risk drawing incorrect conclusions, which might expose additional security risks (e.g., "privilege escalation").

Taken from the same section I linked above.

sigmavirus24 avatar Jun 08 '14 16:06 sigmavirus24

Agreed from a correctness perspective, but I wonder if @kennethreitz is going to want it from a usability perspective.

Lukasa avatar Jun 08 '14 16:06 Lukasa

I wonder how easy it would be to prop up a simple app to demonstrate the security risks involved to give a concrete example why not to do it.

sigmavirus24 avatar Jun 08 '14 16:06 sigmavirus24

If only 1.) is going to be implemented, i guess r.encoding = None and requests will use chardet?

untitaker avatar Jun 12 '14 12:06 untitaker

Correct. =)

Lukasa avatar Jun 12 '14 13:06 Lukasa

That's how it works now, so I don't think we'd change that.

sigmavirus24 avatar Jun 12 '14 13:06 sigmavirus24

I wonder how easy it would be to prop up a simple app to demonstrate the security risks involved to give a concrete example why not to do it.

Look up all the UTF-7 XSS attacks. (None work in any current browser, as everyone simply dropped UTF-7 sniffing — and most UTF-7 support entirely — to avoid making sites thus vulnerable.)

In a very real sense, using chardet is worse than option three above — it will make different conclusions to what any implementation following a specification defining how to sniff the content would (and both HTML and XML provide such a specification). The only safe thing to do is if you don't know how to determine the charset is to not try. You can probably support the vast majority of users by implementing the (standardised) HTML and XML character encoding detection algorithms.

I checked the registration for text/html here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.

Hmm, the registration (which is included inline in the HTML spec) contradicts the HTML spec itself — per the HTML spec, UTF-8/UTF-16 BOMs are given precedence over the MIME type. I've filed bug 26100 for that.

gsnedders avatar Jun 15 '14 23:06 gsnedders

Hmmm....

kennethreitz avatar Jul 10 '14 20:07 kennethreitz

http://html5.org/tools/web-apps-tracker?from=8723&to=8724 fixes the IANA registration in the HTML spec to match the body of it. It now reads:

The charset parameter may be provided to specify the document's character encoding, overriding any character encoding declarations in the document other than a Byte Order Mark (BOM). The parameter's value must be one of the labels of the character encoding used to serialise the file. [ENCODING]

gsnedders avatar Sep 01 '14 14:09 gsnedders

Hello,

I just got hit by this reading XML files which were encoded as UTF8. On OSX the content type was being returned as 'application/xml' but on linux it was set to 'text/xml' therefore the requests lib assumed its default encoding of 'ISO-8859-1' as 'text' was in the content. Most XML files will be encoded in UTF8 so setting the encoding as 'ISO-8859-1' for 'text/xml' content is surely wrong as discussed.

erowan avatar Feb 25 '15 17:02 erowan

Just because an RFC specifies something, doesn't mean we should do it. Especially if it makes the code crazy.

I believe that our current behavior is elegant and actually works quite effectively. Is this not the case?

As always, what does Chrome do?

kennethreitz avatar Mar 05 '15 11:03 kennethreitz

Chrome introspects the HTML, a position we've always decided we don't want to do. We could optionally add support for hooks to do content-type specific encoding heuristics if we wanted. We already kinda do that for JSON, it might not hurt to do it more generally for other content types.

Lukasa avatar Mar 05 '15 11:03 Lukasa

Grumble.

kennethreitz avatar Mar 06 '15 13:03 kennethreitz

Note the HTML case is even worse than that, really. Because the pre-scan in browsers just looks at the first 1024 bytes or there abouts, and the parser itself can then change it.

gsnedders avatar Mar 06 '15 14:03 gsnedders

I still feel like our current behavior is a great implementation.

kennethreitz avatar Mar 09 '15 00:03 kennethreitz

Ok. =)

How about I knock up a proposal for smarter encoding heuristics, that takes certain known content-types and attempts to gently introspect them for their encodings. At least then we'll have something concrete to discuss.

Lukasa avatar Mar 09 '15 09:03 Lukasa

@Lukasa @kennethreitz Hey guys, for the time being, I don't think we have a obvious solution yet, but can we at least make this ISO-8859-1 optional?

    if 'text' in content_type:
        return 'ISO-8859-1'

This looks way too brutal. Some parameters like Session(auto_encoding=False) would be nice. What do you guys think?

est avatar Nov 24 '15 01:11 est

@est ISO-8859-1 is optional, you can simply set response.encoding yourself. It's a one-line change. =)

Lukasa avatar Nov 24 '15 08:11 Lukasa

@Lukasa But you can't determine whether the initial response.encoding came from the Content-Type header or whether it's the ISO-8859-1 fallback, which means if you want to avoid the fallback you have to start parsing the Content-Type header yourself, and that's quite a lot of extra complexity all of a sudden.

gsnedders avatar Dec 15 '15 17:12 gsnedders

@gsnedders Sure you can. if 'charset' in response.headers['Content-Type'].

Lukasa avatar Dec 15 '15 17:12 Lukasa

While yes, that works under the assumption the other party is doing something sane, it doesn't work in the face of madness and something like text/html; foo=charset.

gsnedders avatar Dec 15 '15 17:12 gsnedders

@gsnedders Try if 'charset=' in response.headers['Content-Type']. At this point we're out of 'crazy' and into 'invalid formatting'.

Lukasa avatar Dec 15 '15 17:12 Lukasa

@Lukasa uh, I thought there was whitespace (or rather what 2616 had as implicit *LWS) allowed around the equals, apparently not.

The grammar appears to be:

     media-type = type "/" subtype *( OWS ";" OWS parameter )
     type       = token
     subtype    = token

   The type/subtype MAY be followed by parameters in the form of
   name=value pairs.

     parameter      = token "=" ( token / quoted-string )

So I guess the only issue here is something like text/html; foo="charset=bar".

FWIW, html5lib's API changes are going to make the correct behaviour with requests require something like:

r = requests.get('https://api.github.com/events')
e = response.encoding if 'charset=' in response.headers['Content-Type'] else None
tree = html5lib.parse(r.content, transport_encoding=e)

gsnedders avatar Dec 15 '15 17:12 gsnedders

That seems reasonable. =)

FWIW, in requests 3.0.0 we'll reconsider this, or at least consider adding some way of working out what decision we made.

Lukasa avatar Dec 15 '15 17:12 Lukasa

Interesting discussion.

I still feel like our current behavior is a great implementation.

If I may, a counterexample. I use requests to extract the <title> from a requested document, and here is facebook.com.

>>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> r.encoding
'ISO-8859-1'
>>> import re
>>> m = re.search('<title[^>]*>\s*(.+?)\s*<\/title>', r.text, re.IGNORECASE|re.MULTILINE)
>>> m.group(1)
u'Mathias Sundin - M\xc3\xa5nga roliga samtal mellan Tony Blair och... | Facebook'
>>> r.headers['Content-Type']
'text/html'

Apparently they don't specify the encoding in their headers. But they do in the HTML (<meta charset="utf-8" />, full example at https://gist.github.com/dentarg/f13ef7cc0ce55753faf6).

As mentioned in #2161, "requests aren't a HTML library", and I can understand that point of view. Perhaps I should be looking into parsing the HTML with some library that also can take the specified encoding into consideration.

Thank you for your work on requests. I'm looking forward to 3.0.0.

dentarg avatar Jan 09 '16 12:01 dentarg

@dentarg That's not really a counter example: it's an example of why this system works.

The guidance from the IETF is that for all text/* encodings, one of the following things MUST be true:

  • the peer MUST send a charset in the content-type
  • the content MUST carry an encoding specifier in it (HTML, XML)

Requests' default behaviour here works well: in the case of XML and HTML, ISO-8859-1 is guaranteed to safely decode the text well enough to let you see the <meta> tag. Anyone working with HTML really should go looking for that tag, because servers almost never correctly advertise the content type of the HTML they deliver, but the HTML itself is usually (though sadly not always) right.

The behaviour requests has right now is probably less prone to failure with HTML than the one proposed for 3.0.0, and part of me does wonder if we should try to solve this more by documentation than by code change.

Lukasa avatar Jan 09 '16 14:01 Lukasa

Yes, perhaps documentation is the way forward.

I can share my experience. I'm a very new user of requests, and I haven't looked at all the documentation for requests, but I have looked some. The requests website have this in the code snippet on the front page:

>>> r.encoding
'utf-8'

That information, combined with me somehow (maybe from browsing issues here on GitHub) finding out that requests uses chardet, gave me the impression that requests would solve the charset/encoding problem for me "all the way".

Once I found the right documentation, it was straightforward to workaround the limitations with another library. I can fully understand that requests just want to be the HTTP parser, not the HTML parser.

All that said, let me share some short examples where I think the ISO-8859-1 fallback might cause some unexpected behavior.

>>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> from bs4 import BeautifulSoup

You can't use r.text:

>>> print BeautifulSoup(r.text, "html5lib", from_encoding="").title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook
>>> print BeautifulSoup(r.text, "html5lib", from_encoding=r.encoding).title.text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 215, in __init__
    self._feed()
  File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 239, in _feed
    self.builder.feed(self.markup)
  File "/usr/local/lib/python2.7/site-packages/bs4/builder/_html5lib.py", line 50, in feed
    doc = parser.parse(markup, encoding=self.user_specified_encoding)
  File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 236, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 89, in _parse
    parser=self, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/html5lib/tokenizer.py", line 40, in __init__
    self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
  File "/usr/local/lib/python2.7/site-packages/html5lib/inputstream.py", line 144, in HTMLInputStream
    raise TypeError("Cannot explicitly set an encoding with a unicode string")
TypeError: Cannot explicitly set an encoding with a unicode string

You need to use r.content, but if requests have fallen back to ISO-8859-1, r.encoding will cause trouble:

>>> print BeautifulSoup(r.content, "html5lib", from_encoding=r.encoding).title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook

Working as intended:

>>> print BeautifulSoup(r.content, "html5lib", from_encoding="").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook
>>> print BeautifulSoup(r.content, "html5lib", from_encoding="utf-8").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook

dentarg avatar Jan 09 '16 15:01 dentarg

So here we hit a problem, which is that we can't really document the way Requests interacts with every possible tool you may want to use it with: there are just too many possibilities! So instead we try to provide general documentation.

The best advice I can give is that the more specific the tool, the more likely it can handle bytes as an input and DTRT with them. BeautifulSoup is a HTML tool and so can almost certainly take a stream of arbitrary bytes and find the relevant meta tag (and indeed, it can), so you can just pass it r.content. The same would be true if you were passing it to an XML library, or to a JSON library, or to anything else like that.

Lukasa avatar Jan 09 '16 16:01 Lukasa

Requests' default behaviour here works well: in the case of XML and HTML, ISO-8859-1 is guaranteed to safely decode the text well enough to let you see the <meta> tag.

Sadly, that isn't true, because of UTF-16. In both cases, they should be able to handle BOMs. Furthermore, a conforming XML parser should be able to handle a UTF-16 encoded <?xml version="1.0" encoding="UTF-16"> with no preceding BOM.

gsnedders avatar Mar 29 '16 08:03 gsnedders