python-seo-analyzer UnicodeEncodeError: 'ascii' codec can't encode characters when calling self.

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii'))

Open Lima-Codes opened this issue 3 years ago • 2 comments

Describe the bug When crawling websites that have non-ascii characters in the URL (for example the character é), I get this error:

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii'))

To Reproduce Steps to reproduce the behavior:

Run seoanalyze https://www.archi-graph.com/
This website has pages with URLs containing non-ascii characters and will throw the error above

Expected behavior Program should run as normal

Desktop (please complete the following information):

OS: Windows 10
Browser: N/A

Smartphone (please complete the following information): N/A

Additional context I propose a fix that sanitizes all URLs passed to the get method in the http module

Mar 31 '21 13:03 Lima-Codes

Proposed fix in the http module

import certifi
import urllib3
from urllib import parse


class Http():
    def __init__(self):
        user_agent = {'User-Agent': 'Mozilla/5.0'}
        self.http = urllib3.PoolManager(
            timeout=urllib3.Timeout(connect=1.0, read=2.0),
            cert_reqs='CERT_REQUIRED',
            ca_certs=certifi.where(),
            headers=user_agent
        )

    def get(self, url):
        sanitized_url = self.sanitize_url(url)
        return self.http.request('GET', sanitized_url)

    @staticmethod
    def sanitize_url(url):
        scheme, netloc, path, query, fragment = parse.urlsplit(url)
        path = parse.quote(path)
        sanitized_url = parse.urlunsplit((scheme, netloc, path, query, fragment))
        return sanitized_url

http = Http()

Adding the sanitize_url static method fixes the issue described above.

Tested successfully by running seoanalyze https://www.archi-graph.com/ in the command line.

Mar 31 '21 13:03 Lima-Codes

Nice. Thank you for this. I can get your fix dropped in the next release.

Apr 04 '21 00:04 sethblack

python-seo-analyzer python-seo-analyzer copied to clipboard

UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii'))

Proposed fix in the http module

python-seo-analyzer
python-seo-analyzer copied to clipboard