urlchecker-action icon indicating copy to clipboard operation
urlchecker-action copied to clipboard

Worry the checker is actually downloading the content at the checked URL

Open markcmiller86 opened this issue 1 year ago • 9 comments

For some reason, this goes pretty slowly. I am working from this document and it takes quite a while to complete a check. Next, I notice that on .pdf files, it stalls for longer, especially the one at the ftp link.

So, this has me worried that it is actually fully getting the content to check the link. I've seen similar in the Sphinx URL checking feature too. It should really just be get a header at each URL and not the full content.

Is this something you've looked into?

markcmiller86 avatar Feb 01 '24 05:02 markcmiller86

We should do head instead of get, I agree. We haven't looked into it but can.

vsoch avatar Feb 01 '24 06:02 vsoch

Would you like me to update the branch we are working on to try it out?

vsoch avatar Feb 01 '24 06:02 vsoch

ChatGPT suggests something along the lines of

import requests

def is_url_working(url):
    try:
        response = requests.head(url)
        # You might also want to check for redirects (response.status_code == 302)
        if response.status_code == 200:
            return True
        elif response.status_code == 405: # head requests disallowed
            retval = False
            response = requests.get(url, stream=True)
            if response.status_code == 200:
                retval = True
            response.close()  # Make sure to close the response
            return retval
    except requests.RequestException as e:
        print(f"An error occurred: {e}")
        return False

# Example usage
url = "http://example.com"
if is_url_working(url):
    print(f"The URL {url} is working.")
else:
    print(f"The URL {url} is not working.")

markcmiller86 avatar Feb 01 '24 06:02 markcmiller86

Oh geez chatGPT? 🙃 The first lines I see issues:

  • follow redirects is a parameter on head. https://requests.readthedocs.io/en/latest/user/quickstart/#redirection-and-history

The retval and response.close don’t make sense.

I appreciate the suggestion but I don’t think the quality of code from AI tools is very good. It’s mostly copy pasting some poor souls code from somewhere else on GitHub. I’m happy to write this with my own knowledge and careful inspection of core docs and library code to get the functionality I want.

vsoch avatar Feb 01 '24 07:02 vsoch

But I have to take it back - it does look like response.close() is useful for requests.get() ! Geez, I've been writing in Python a long time and I just don't see it very often. So I learned something from ChatGPT! I appreciate the post, and I'll try to be more open minded about it (even if I don't use it)!

vsoch avatar Feb 01 '24 07:02 vsoch

I'll try to be more open minded about it (even if I don't use it)!

So, I just happened to see this Q&A with Linus Torvalds about AI tools in coding...

markcmiller86 avatar Feb 01 '24 17:02 markcmiller86

haha I totally watched that! I'll watch again tonight with new context.

vsoch avatar Feb 01 '24 18:02 vsoch

So, the more I think about this, the more I think HEAD requests are the most likely to be useful for links to non .html content (e.g. .pdf or other binary content a link references. I think for most .html content, the cost to download is likely minimal.

markcmiller86 avatar Feb 03 '24 18:02 markcmiller86

Agree! Let me work hard today (just presented at FOSDEM) and maybe I can do some work on this later if I'm productive!

vsoch avatar Feb 03 '24 18:02 vsoch