Look for repo and homepage redirects
It would be nice to add a spider that looked for redirects on the homepage and repo URL and automatically (or with manual intervention) adjusted them. We do this for https://landscape.cncf.io/ with https://github.com/cncf/landscapeapp/blob/master/tools/checkLinks.js and find that (good news!) many homepages have moved to https over the last couple years.
Request is inspired by https://github.com/coreinfrastructure/best-practices-badge/issues/1365
Good idea. We have modified badge app so that people can easily update their repo URLs themselves. I guess there is a risk that a redirection might be temporary for not considered reliable. So we might want to try doing it several times, and only update if the redirection appears to be stable over time.
On further reflection, I'm a lot less certain about this. People sometimes setup "real" domains, and then redirect to the site that currently actually serves the information, but want the "real" domain to stay as-is. They should be using temporary redirects when they do this, but I suspect some people don't even know that there are different kinds of redirects or accidentally use the wrong one. So while doing this automatically would be helpful to some, I fear we might accidentally do the wrong thing for others.
Until recently people couldn't modify their repo URLs for security reasons (we want to keep people from rapidly switching their "identity"). But we just recently changed that (now they can change the repo URL, as long as they haven't changed the repo URL within 180 days).
So now projects can fix the information themselves, and if they do it themselves, then there's no issue. Most end users won't consider the current situation a problem - when they click on a link, they get redirected to the right place.
So while I agree it wouldn't be hard to do, I'm a lot less certain we should do it.
If we ignore the homepage, I definitely think it's worth doing when the GitHub repo moves. In that case, we're serving outdated information (including what external sites like the CNCF Landscape use to lookup against).
For homepages, if they've moved from http to https, I think that at least is worth us updating.
If the repo URL moves from one GitHub URL to another GitHub URL, I agree that that is almost certainly fine.
There is some trickiness with the homepage url. Even if it works, sometimes the HTTPS url is not reliable. We already have that problem with Valgrind, which has an HTTP URL but it doesn't really work properly due to certificate problems. Perhaps if it works well for a number of days that is good enough evidence to automatically update.
I agree with the overall goal, I just don't want to replace good data with data that appears like it's an update but in fact is wrong.
Github handles redirects pretty reliably, but the redirect itself may not last forever if the original home creates a repo with a colliding name - so the instance a redirect is discovered from one github repo to another (which i assume is a 301) i think that the URL should be updated.
For anything else, HTTP status codes should dictate it - a 301 or a 307 is permanent, and it's incorrect NOT to delete the old URL and replace it with the new one - and a 302 or 308 is temporary, and the only correct thing there is to keep the old URL.
For anything else, HTTP status codes should dictate it - a 301 or a 307 is permanent, and it's incorrect NOT to delete the old URL and replace it with the new one - and a 302 or 308 is temporary, and the only correct thing there is to keep the old URL.
Standards-wise, you're right. Our concern is that we've seen cases where the permanent redirect should NOT lead to a replacement. Maybe they should just fix their sites :-).
Indeed, I think that's something they'd just need to fix - especially since google and other search engines would obey the status codes too.