web-poet
web-poet copied to clipboard
BOM should take precedence over Content-Type header when detecting the encoding
As explained in https://github.com/scrapy/w3lib/issues/189 and https://github.com/scrapy/scrapy/issues/5601, BOM should take a precedence over Content-Type headers when detecting an encoding.
Currently web-poet.HttpResponse prefers Content-Type header:
import codecs
import web_poet
body = codecs.BOM + "Привет".encode('utf8')
headers = {"Content-Type": "text/html; charset=cp1251"}
resp = web_poet.HttpResponse(url="http://example.com", headers=headers, body=body, status=200)
print(resp.encoding) # cp1251, expected utf-8
print(resp.text) # яюПривет expected 'Привет'