python-seo-analyzer
python-seo-analyzer copied to clipboard
UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii'))
Describe the bug
When crawling websites that have non-ascii characters in the URL (for example the character é
), I get this error:
UnicodeEncodeError: 'ascii' codec can't encode characters when calling self._output(request.encode('ascii'))
To Reproduce Steps to reproduce the behavior:
- Run
seoanalyze https://www.archi-graph.com/
- This website has pages with URLs containing non-ascii characters and will throw the error above
Expected behavior Program should run as normal
Desktop (please complete the following information):
- OS: Windows 10
- Browser: N/A
Smartphone (please complete the following information): N/A
Additional context
I propose a fix that sanitizes all URLs passed to the get
method in the http
module
Proposed fix in the http module
import certifi
import urllib3
from urllib import parse
class Http():
def __init__(self):
user_agent = {'User-Agent': 'Mozilla/5.0'}
self.http = urllib3.PoolManager(
timeout=urllib3.Timeout(connect=1.0, read=2.0),
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where(),
headers=user_agent
)
def get(self, url):
sanitized_url = self.sanitize_url(url)
return self.http.request('GET', sanitized_url)
@staticmethod
def sanitize_url(url):
scheme, netloc, path, query, fragment = parse.urlsplit(url)
path = parse.quote(path)
sanitized_url = parse.urlunsplit((scheme, netloc, path, query, fragment))
return sanitized_url
http = Http()
Adding the sanitize_url
static method fixes the issue described above.
Tested successfully by running seoanalyze https://www.archi-graph.com/
in the command line.
Nice. Thank you for this. I can get your fix dropped in the next release.