trafilatura
trafilatura copied to clipboard
probe_alternative_homepage no_ssl arg from fetch_url
I'm getting the following error when trying to run probe_alternative_homepage:
ERROR:trafilatura.downloads:retries/redirects: https://www.rpgassetmanagement.com/ HTTPSConnectionPool(host='www.rpgassetmanagement.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1124)')))
When I use fetch_url with no_ssl = True I don't get the error. However, the fetch_url function being used within probe_alternative_homepage doesn't have an option to pass the no_ssl arg. Is there a way to add this to probe_alternative_homepage or is there another way to get around this ssl error when using the probe_alternative_homepage function?
Thanks!
def fetch_url(url, decode=True, no_ssl=False, config=DEFAULT_CONFIG):
"""Fetches page using urllib3 and decodes the response.
Args:
url: URL of the page to fetch.
decode: Decode response instead of returning urllib3 response object (boolean).
no_ssl: Don't try to establish a secure connection (to prevent SSLError).
config: Pass configuration values for output control.
Returns:
RawResponse object: data (headers + body), status (HTML code as string) and url
or None in case the result is invalid or there was a problem with the network.
"""
LOGGER.debug('sending request: %s', url)
if pycurl is None:
response = _send_request(url, no_ssl, config)
else:
response = _send_pycurl_request(url, no_ssl, config)
if response is not None and response != '':
return _handle_response(url, response, decode, config)
# return '' (useful do discard further processing?)
# return response
LOGGER.debug('no response: %s', url)
return None
def probe_alternative_homepage(homepage):
"Check if the homepage is redirected and return appropriate values."
response = fetch_url(homepage, decode=False)
if response is None or response == '':
return None, None, None
# get redirected URL here?
if response.url != homepage:
logging.info('followed redirect: %s', response.url)
homepage = response.url
# decode response
htmlstring = decode_response(response.data)
# is there a meta-refresh on the page?
htmlstring, homepage = refresh_detection(htmlstring, homepage)
logging.info('fetching homepage OK: %s', homepage)
_, base_url = get_hostinfo(homepage)
return htmlstring, homepage, base_url
Thanks.
Hi @hyshandler, thanks for your feedback, I cannot reproduce the bug, maybe your version of certifi
isn't up-to-date.
Regardless of this particular webpage it could make sense to use a more robust download method in probe_alternative_homepage()
, although it's often a good idea to validate HTTPS certificates of unknown websites. I'll leave the issue open and think about it.