web-poet icon indicating copy to clipboard operation
web-poet copied to clipboard

BOM should take precedence over Content-Type header when detecting the encoding

Open kmike opened this issue 1 year ago • 0 comments

As explained in https://github.com/scrapy/w3lib/issues/189 and https://github.com/scrapy/scrapy/issues/5601, BOM should take a precedence over Content-Type headers when detecting an encoding.

Currently web-poet.HttpResponse prefers Content-Type header:

import codecs
import web_poet

body = codecs.BOM + "Привет".encode('utf8')
headers = {"Content-Type": "text/html; charset=cp1251"}
resp = web_poet.HttpResponse(url="http://example.com", headers=headers, body=body, status=200)

print(resp.encoding) # cp1251, expected utf-8
print(resp.text) # яюПривет expected 'Привет'

kmike avatar Aug 16 '22 15:08 kmike