curriculum icon indicating copy to clipboard operation
curriculum copied to clipboard

Make dead link detection more robust

Open vegetabill opened this issue 4 years ago • 12 comments

Recently we added a list of false positives:

https://github.com/Techtonica/curriculum/blob/main/meta/false-dead-links.md

I'm assuming these are caused by:

  • sites that block bots
  • since we have many links to the same sites we might be getting rate limited

Ideally when we run the report, it should be easy to see if we actually have dead links.

vegetabill avatar Dec 31 '20 07:12 vegetabill

@gsong any ideas on this?

vegetabill avatar Dec 31 '20 07:12 vegetabill

You mean like running a diff automatically? That would be great - I haven't thought about a real script yet.

alodahl avatar Dec 31 '20 18:12 alodahl

I did add a line to be aware of the false positives in CONTRIBUTING.md, at least.

alodahl avatar Dec 31 '20 18:12 alodahl

Also @CoderCarrot do you have more insight on Bills first comment here?

alodahl avatar Dec 31 '20 18:12 alodahl

Also @CoderCarrot do you have more insight on Bills first comment here?

I was thinking about this. I do not have any immediate or straight-forward ideas on this, but it's something I could work on. Someone more experienced may come up with a quicker, more elegant solution, but I would be happy to look into it when I have time!

CoderCarrot avatar Jan 05 '21 01:01 CoderCarrot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Feb 19 '21 11:02 stale[bot]

I think this would be useful. I merged better StaleBot rules and am reopening this

vegetabill avatar Feb 27 '21 21:02 vegetabill

hey @alodahl @vegetabill, I'd be happy to take a stab at this with some of the available lint rule's config options

manufacturedba avatar Sep 01 '21 06:09 manufacturedba

sites that block bots

Majority of false positives seem to fall under this category. The PR I put up skips the 2 highest offenders (codepen/github) and localhost. They were also inducing a lot more timeouts when combined with other config changes I tested. Obviously skipping is not the ideal route, but it cuts down enough noise to be trusted.

I think to get these domains back, they'll need to be checked by hand and/or checked far less often.

  1. Collect all links using remark
  2. Read existing link list
  3. Filter for links that either have no entry or have an expired timestamp
  4. Write back to link list with successful links and some future timestamp
  5. Output failed links

To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.

manufacturedba avatar Sep 05 '21 23:09 manufacturedba

thanks for the insights, @manufacturedba . would you be open to adding these ideas as notes to our last section in https://github.com/Techtonica/curriculum/blob/main/CONTRIBUTING.md#L58 as part of the PR so the knowledge isn’t lost?

alodahl avatar Sep 13 '21 02:09 alodahl

Yup, do you have thoughts on the suggested manual steps to be feasible for the team?

Mainly its the following

To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.

I will include the steps with the PR that implements this. The current PR is only a stop-gap.

manufacturedba avatar Sep 13 '21 19:09 manufacturedba

Yup, do you have thoughts on the suggested manual steps to be feasible for the team?

Mainly its the following

To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.

I will include the steps with the PR that implements this. The current PR is only a stop-gap.

sounds good to me!

alodahl avatar Oct 03 '21 19:10 alodahl