trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

probe_alternative_homepage no_ssl arg from fetch_url

Open hyshandler opened this issue 1 year ago • 1 comments

I'm getting the following error when trying to run probe_alternative_homepage:

ERROR:trafilatura.downloads:retries/redirects: https://www.rpgassetmanagement.com/ HTTPSConnectionPool(host='www.rpgassetmanagement.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1124)')))

When I use fetch_url with no_ssl = True I don't get the error. However, the fetch_url function being used within probe_alternative_homepage doesn't have an option to pass the no_ssl arg. Is there a way to add this to probe_alternative_homepage or is there another way to get around this ssl error when using the probe_alternative_homepage function?

Thanks!

def fetch_url(url, decode=True, no_ssl=False, config=DEFAULT_CONFIG):
    """Fetches page using urllib3 and decodes the response.

    Args:
        url: URL of the page to fetch.
        decode: Decode response instead of returning urllib3 response object (boolean).
        no_ssl: Don't try to establish a secure connection (to prevent SSLError).
        config: Pass configuration values for output control.

    Returns:
        RawResponse object: data (headers + body), status (HTML code as string) and url
        or None in case the result is invalid or there was a problem with the network.

    """
    LOGGER.debug('sending request: %s', url)
    if pycurl is None:
        response = _send_request(url, no_ssl, config)
    else:
        response = _send_pycurl_request(url, no_ssl, config)
    if response is not None and response != '':
        return _handle_response(url, response, decode, config)
        # return '' (useful do discard further processing?)
        # return response
    LOGGER.debug('no response: %s', url)
    return None

def probe_alternative_homepage(homepage):
    "Check if the homepage is redirected and return appropriate values."
    response = fetch_url(homepage, decode=False)
    if response is None or response == '':
        return None, None, None
    # get redirected URL here?
    if response.url != homepage:
        logging.info('followed redirect: %s', response.url)
        homepage = response.url
    # decode response
    htmlstring = decode_response(response.data)
    # is there a meta-refresh on the page?
    htmlstring, homepage = refresh_detection(htmlstring, homepage)
    logging.info('fetching homepage OK: %s', homepage)
    _, base_url = get_hostinfo(homepage)
    return htmlstring, homepage, base_url

Thanks.

hyshandler avatar Mar 07 '23 15:03 hyshandler

Hi @hyshandler, thanks for your feedback, I cannot reproduce the bug, maybe your version of certifi isn't up-to-date.

Regardless of this particular webpage it could make sense to use a more robust download method in probe_alternative_homepage(), although it's often a good idea to validate HTTPS certificates of unknown websites. I'll leave the issue open and think about it.

adbar avatar Mar 07 '23 16:03 adbar