curriculum Make dead link detection more robust

Recently we added a list of false positives:

https://github.com/Techtonica/curriculum/blob/main/meta/false-dead-links.md

I'm assuming these are caused by:

sites that block bots
since we have many links to the same sites we might be getting rate limited

Ideally when we run the report, it should be easy to see if we actually have dead links.

Dec 31 '20 07:12 vegetabill

@gsong any ideas on this?

Dec 31 '20 07:12 vegetabill

You mean like running a diff automatically? That would be great - I haven't thought about a real script yet.

Dec 31 '20 18:12 alodahl

I did add a line to be aware of the false positives in CONTRIBUTING.md, at least.

Dec 31 '20 18:12 alodahl

Also @CoderCarrot do you have more insight on Bills first comment here?

Dec 31 '20 18:12 alodahl

Also @CoderCarrot do you have more insight on Bills first comment here?

I was thinking about this. I do not have any immediate or straight-forward ideas on this, but it's something I could work on. Someone more experienced may come up with a quicker, more elegant solution, but I would be happy to look into it when I have time!

Jan 05 '21 01:01 CoderCarrot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Feb 19 '21 11:02 stale[bot]

I think this would be useful. I merged better StaleBot rules and am reopening this

Feb 27 '21 21:02 vegetabill

hey @alodahl @vegetabill, I'd be happy to take a stab at this with some of the available lint rule's config options

Sep 01 '21 06:09 manufacturedba

sites that block bots

Majority of false positives seem to fall under this category. The PR I put up skips the 2 highest offenders (codepen/github) and localhost. They were also inducing a lot more timeouts when combined with other config changes I tested. Obviously skipping is not the ideal route, but it cuts down enough noise to be trusted.

I think to get these domains back, they'll need to be checked by hand and/or checked far less often.

Collect all links using remark
Read existing link list
Filter for links that either have no entry or have an expired timestamp
Write back to link list with successful links and some future timestamp
Output failed links

To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.

Sep 05 '21 23:09 manufacturedba

thanks for the insights, @manufacturedba . would you be open to adding these ideas as notes to our last section in https://github.com/Techtonica/curriculum/blob/main/CONTRIBUTING.md#L58 as part of the PR so the knowledge isn’t lost?

Sep 13 '21 02:09 alodahl

Yup, do you have thoughts on the suggested manual steps to be feasible for the team?

Mainly its the following

To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.

I will include the steps with the PR that implements this. The current PR is only a stop-gap.

Sep 13 '21 19:09 manufacturedba

Yup, do you have thoughts on the suggested manual steps to be feasible for the team?

Mainly its the following

To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.

I will include the steps with the PR that implements this. The current PR is only a stop-gap.

sounds good to me!

Oct 03 '21 19:10 alodahl

curriculum curriculum copied to clipboard

Make dead link detection more robust

curriculum
curriculum copied to clipboard