web-poet BOM should take precedence over Content-Type header when detecting the encoding

BOM should take precedence over Content-Type header when detecting the encoding

Open kmike opened this issue 1 year ago • 0 comments

As explained in https://github.com/scrapy/w3lib/issues/189 and https://github.com/scrapy/scrapy/issues/5601, BOM should take a precedence over Content-Type headers when detecting an encoding.

Currently web-poet.HttpResponse prefers Content-Type header:

import codecs
import web_poet

body = codecs.BOM + "РџСЂРёРІРµС‚".encode('utf8')
headers = {"Content-Type": "text/html; charset=cp1251"}
resp = web_poet.HttpResponse(url="http://example.com", headers=headers, body=body, status=200)

print(resp.encoding) # cp1251, expected utf-8
print(resp.text) # СЏСЋР СџРЎР‚Р С‘Р Р†Р ВµРЎвЂљ expected 'РџСЂРёРІРµС‚'

Aug 16 '22 15:08 kmike

web-poet web-poet copied to clipboard

BOM should take precedence over Content-Type header when detecting the encoding

web-poet
web-poet copied to clipboard