suckit
suckit copied to clipboard
Ignore pages that have a 404 status code
Currently suckit will save pages even if they are indicated as not found by the webserver. I think this is erroneous behaviour.
Eg this page on my site that 404s was saved to disk.
Chrome dev tools:
File explorer:
We could have one 404 error page by website
As long as you're aware that is an opinionated choice :) some sites have custom 404s by section of the site etc, some will keep the original URL like in my screenshot, some will redirect to a dedicated 404 URL, some will show a 404 page with a 200 response.. Web crawling is messy!
Perhaps this could be a configuration thing, but that's up to you :)
A good solution can be to hash a 404 or 200 webpage. This way if the page is specific to this URL it is saved, if not we could make a symbolic link to the generic one.
Yea I think it's tricky. If it's legitimately just a bad link to a page that was never existed or a href that was relative when it shouldn't have been you might hit an infinite loop (i've seen this in practise).
Humm ok. We have more serious issues and very little time currently, we will give this a try latter
Yea no rush :)