suckit Ignore pages that have a 404 status code

Ignore pages that have a 404 status code

Open tbillington opened this issue 4 years ago • 6 comments

Currently suckit will save pages even if they are indicated as not found by the webserver. I think this is erroneous behaviour.

Eg this page on my site that 404s was saved to disk.

Chrome dev tools: Screen Shot 2020-05-10 at 8 01 14 pm

File explorer: Screen Shot 2020-05-10 at 8 01 22 pm

May 10 '20 10:05 tbillington

We could have one 404 error page by website

May 10 '20 14:05 Skallwar

As long as you're aware that is an opinionated choice :) some sites have custom 404s by section of the site etc, some will keep the original URL like in my screenshot, some will redirect to a dedicated 404 URL, some will show a 404 page with a 200 response.. Web crawling is messy!

Perhaps this could be a configuration thing, but that's up to you :)

May 12 '20 01:05 tbillington

A good solution can be to hash a 404 or 200 webpage. This way if the page is specific to this URL it is saved, if not we could make a symbolic link to the generic one.

Jan 04 '21 19:01 Skallwar

Yea I think it's tricky. If it's legitimately just a bad link to a page that was never existed or a href that was relative when it shouldn't have been you might hit an infinite loop (i've seen this in practise).

Jan 04 '21 22:01 tbillington

Humm ok. We have more serious issues and very little time currently, we will give this a try latter

Jan 05 '21 08:01 Skallwar

Yea no rush :)

Jan 05 '21 22:01 tbillington

suckit suckit copied to clipboard

Ignore pages that have a 404 status code

suckit
suckit copied to clipboard