requests-html
requests-html copied to clipboard
When requesting a page that is ISO-8859-1 encoded, HTML is still interpreted as UTF-8
When requesting a page that is ISO-8859-1 encoded:
>>> r = session.get('https://gerda.geus.dk/Gerda/Search')
>>> r.encoding
'ISO-8859-1'
>>> r.html.default_encoding
'ISO-8859-1'
>>> r.html.encoding
'utf8'
>>> r.html.find("option")[-1].text
'Bygge-anl�g'
Expected behavior:
>>> r.html.find("option")[-1].text
'Bygge-anlæg'
As far as I can see, there are two problems:
r.html.encodingis incorrectly setr.html.element(ThePyQueryinstance) does not take encoding into account at all but just assumes utf-8
I am studding this behavior. Until we find what is happening, here is a workaround.
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://gerda.geus.dk/Gerda/Search'
r = session.get(url)
r.html.encoding = 'ISO-8859-1'
print( r.html.find("option")[-1].text )
output: Bygge-anlæg
i found a approach here, requests_html.py should be patched.
requests_html.py
in function render()
def render(self, retries: int = 8, script: str = None, wait: float = 0.2, scrolldown=False, sleep: int = 0, reload: bool = True, timeout: Union[float, int] = 8.0, keep_page: bool = False):
for the last part, original code is
html = HTML(url=self.url, html=content.encode(DEFAULT_ENCODING), default_encoding=DEFAULT_ENCODING)
self.__dict__.update(html.__dict__)
self.page = page
return result
should be changed to
html = HTML(url=self.url, html=content.encode(self.encoding), default_encoding=DEFAULT_ENCODING)
self.__dict__.update(html.__dict__)
self.page = page
return result
the param html should use self.encoding from website.
and i tried, it works perfect.
demo code
session = HTMLSession()
response = session.get("http://cpc.people.com.cn/GB/64162/64168/64558/index.html")
response.html.render()
if response.status_code == 200:
title_node = response.html.xpath(
'/html/body/table/tbody/tr[2]/td/table[1]/tbody/tr/td[2]/table/tbody/tr/td/table[2]/tbody/tr/td[2]/table/tbody/tr[2]/td')
print(title_node[0].text)