curriculum
curriculum copied to clipboard
Make dead link detection more robust
Recently we added a list of false positives:
https://github.com/Techtonica/curriculum/blob/main/meta/false-dead-links.md
I'm assuming these are caused by:
- sites that block bots
- since we have many links to the same sites we might be getting rate limited
Ideally when we run the report, it should be easy to see if we actually have dead links.
@gsong any ideas on this?
You mean like running a diff automatically? That would be great - I haven't thought about a real script yet.
I did add a line to be aware of the false positives in CONTRIBUTING.md, at least.
Also @CoderCarrot do you have more insight on Bills first comment here?
Also @CoderCarrot do you have more insight on Bills first comment here?
I was thinking about this. I do not have any immediate or straight-forward ideas on this, but it's something I could work on. Someone more experienced may come up with a quicker, more elegant solution, but I would be happy to look into it when I have time!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I think this would be useful. I merged better StaleBot rules and am reopening this
hey @alodahl @vegetabill, I'd be happy to take a stab at this with some of the available lint rule's config options
sites that block bots
Majority of false positives seem to fall under this category. The PR I put up skips the 2 highest offenders (codepen/github) and localhost. They were also inducing a lot more timeouts when combined with other config changes I tested. Obviously skipping is not the ideal route, but it cuts down enough noise to be trusted.
I think to get these domains back, they'll need to be checked by hand and/or checked far less often.
- Collect all links using
remark
- Read existing link list
- Filter for links that either have no entry or have an expired timestamp
- Write back to link list with successful links and some future timestamp
- Output failed links
To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.
thanks for the insights, @manufacturedba . would you be open to adding these ideas as notes to our last section in https://github.com/Techtonica/curriculum/blob/main/CONTRIBUTING.md#L58 as part of the PR so the knowledge isn’t lost?
Yup, do you have thoughts on the suggested manual steps to be feasible for the team?
Mainly its the following
To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.
I will include the steps with the PR that implements this. The current PR is only a stop-gap.
Yup, do you have thoughts on the suggested manual steps to be feasible for the team?
Mainly its the following
To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.
I will include the steps with the PR that implements this. The current PR is only a stop-gap.
sounds good to me!