requests-html icon indicating copy to clipboard operation
requests-html copied to clipboard

When requesting a page that is ISO-8859-1 encoded, HTML is still interpreted as UTF-8

Open redhog opened this issue 4 years ago • 3 comments

When requesting a page that is ISO-8859-1 encoded:

>>> r = session.get('https://gerda.geus.dk/Gerda/Search')
>>> r.encoding
'ISO-8859-1'
>>> r.html.default_encoding
'ISO-8859-1'
>>> r.html.encoding
'utf8'
>>> r.html.find("option")[-1].text
'Bygge-anl�g'

Expected behavior:

>>> r.html.find("option")[-1].text
'Bygge-anlæg'

As far as I can see, there are two problems:

  • r.html.encoding is incorrectly set
  • r.html.element (The PyQuery instance) does not take encoding into account at all but just assumes utf-8

redhog avatar Jan 27 '21 10:01 redhog

I am studding this behavior. Until we find what is happening, here is a workaround.

from requests_html import HTMLSession

session = HTMLSession()

url = 'https://gerda.geus.dk/Gerda/Search'

r = session.get(url)

r.html.encoding = 'ISO-8859-1'

print( r.html.find("option")[-1].text )

output: Bygge-anlæg

davidkwast avatar Feb 11 '21 00:02 davidkwast

i found a approach here, requests_html.py should be patched.

requests_html.py

in function render()

def render(self, retries: int = 8, script: str = None, wait: float = 0.2, scrolldown=False, sleep: int = 0, reload: bool = True, timeout: Union[float, int] = 8.0, keep_page: bool = False):

for the last part, original code is

        html = HTML(url=self.url, html=content.encode(DEFAULT_ENCODING), default_encoding=DEFAULT_ENCODING)
        self.__dict__.update(html.__dict__)
        self.page = page
        return result

should be changed to

        html = HTML(url=self.url, html=content.encode(self.encoding), default_encoding=DEFAULT_ENCODING)
        self.__dict__.update(html.__dict__)
        self.page = page
        return result

the param html should use self.encoding from website.

and i tried, it works perfect.

ziyouchutuwenwu avatar Apr 13 '21 04:04 ziyouchutuwenwu

demo code

session = HTMLSession()
response = session.get("http://cpc.people.com.cn/GB/64162/64168/64558/index.html")
response.html.render()

if response.status_code == 200:
    title_node = response.html.xpath(
        '/html/body/table/tbody/tr[2]/td/table[1]/tbody/tr/td[2]/table/tbody/tr/td/table[2]/tbody/tr/td[2]/table/tbody/tr[2]/td')
    print(title_node[0].text)

ziyouchutuwenwu avatar Apr 13 '21 04:04 ziyouchutuwenwu