Describe the bug

I can see there are similar reports on this issue (https://github.com/wjdp/htmltest/issues/105), but I have the same result even without concurrent mode enabled. Interestingly, the checks work beautifully when run locally, but we get the error when it runs on CI jobs.

Non-OK status: 429 --- docs/operational_guide/resource_limits/index.html --> https://github.com/m3db/m3/tree/docs-test/site/content/docs/operational_guide/resource_limits.md

etc. And yes, it's just GitHub links.

To Reproduce

htmltest is running in CI over a Hugo site like this.

docker run --rm -it -v $(PWD)/site:/src klakegg/hugo:ext-alpine
curl https://htmltest.wjdp.uk | bash
./bin/htmltest -c site/.htmltest.yml

.htmltest.yml

Please copy in your config file

DirectoryPath: site/public
IgnoreDirectoryMissingTrailingSlash: true
CheckExternal: true
IgnoreAltMissing: false
CheckImages: false
CheckScripts: false
CheckMeta: false
CheckMetaRefresh: false
CheckLinks: false
EnforceHTML5: false
EnforceHTTPS: false
IgnoreInternalEmptyHash: true
IgnoreEmptyHref: true
TestFilesConcurrently: false
DocumentConcurrencyLimit: 64
HTTPConcurrencyLimit: 8
LogLevel: 0
IgnoreURLs:
- "youtu.be"
- "youtube.com"
- "cassandra.apache.org"

Versions

OS: local - macOS 10.15.7 (19H2), CI - Not completely sure… Sorry, Linux of some flavour
htmltest: htmltest 0.13.0

Oct 15 '20 15:10 ChrisChinchilla

@wjdp would you be open to accepting a PR that added a self-imposed rate limit for certain URLs?

e.g. something like this (what do you think of the general concept if someone else was to open the change?)

RateLimit:
- URL: "github.com"
  MaxRequests: 10
  Within: 1s

Oct 19 '20 13:10 robskillington

Hi @ChrisChinchilla :wave:

From "works locally, fails on CI" I suspect your local broadband bottlenecks you enough that rate limiting isn't a problem there but the fat pipe at your CI provider allows for a much more effective DoS directed at github :wink: hence the 429s.

This is a problem with htmltest, right now we just scale to whatever bandwidth is available even if that hits external rate limits.

Hey @robskillington :wave:

Yes, this sounds like the solution we need! I think we probably want a generic option first rather than a specific option but am open to either really. Very happy for someone else to pick this up!

Generic option

Rate limit connections per hostname without needing to specify it per host in config. You'd set a limit in config, or perhaps we have a sensible default, which is applied to all hostnames.

Specific option

As you suggest. Or perhaps both? A default and custom overrides.

Config naming

Likely the most contentious part of a PR for this would be the config names.

Your suggestion looks alright, but how do we handle the generic/default case (if there is to be one, or for one in the future)
Separate MaxRequests and Within vars is quite flexible, but could we do with assuming any rate limit is going to be per second? Or do we leave Within in there and just default it to 1s so most users don't need to think about it.

Perhaps something like

RateLimit:
- Default: true
  MaxRequests: 20
  Within: 1s [optional]
- URL: "github.com"
  MaxRequests: 10
  Within: 1s [optional]

What do you think?

Other solutions

A thought I had half way through writing this reply:

If we see a 429 we could do an exponential backoff to attempt to recover. This would solve the problem without having to resort to hard-coded rate limits which could still break by hitting unpredictable rate limits.

Arguably this could be a better solution as it will fix the 429 problem for everyone without them having to set up rate limits nor htmltest having to impose a default rate limit.

We may find ourselves in a position of needing to do both as some servers likely aren't as nice as GitHub with their status codes though.

I'm sure there's a Go package that could handle this for us somewhat easily. A cursory Google throws up https://github.com/cenkalti/backoff

Oct 19 '20 14:10 wjdp

OK thanks @wjdp we (@robskillington and I work together) ended up getting it working on CI by reducing the links checked to a bare minimum, but hopefully Rob's suggestion will solve the issue for others in the long term :)

Oct 20 '20 09:10 ChrisChinchilla

I wonder if HEAD requests are not affected by rate limiting? I find that using Get with Range=0 is not great for link checking...

Aug 11 '21 12:08 hellt

Hi @robskillington I take it you never started working on this? Or is there anything someone can continue to work on for that issue?

Jan 17 '22 20:01 hellt

I'm getting this too. Doesn't just apply to GitHub but to lots of other endpoints too. It would be useful to have a generic retry for these, but specifying specific sites would also work (although more work for me). Is there any caching of results? I see the same endpoint returning 429 multiple times. Does htmltest check http://example.com/page.html every time if finds it, or does it remember the first result and return it from the cache next time? If there is caching it might help to reduce the number of 429 errors.

Mar 31 '22 10:03 MarkvanMents

I have just run into this in a major way trying to check links on https://afterglow-guide.deepsymmetry.org/ since it has many dozens of links to source code lines in the GitHub repository. For now, I am going to have to disable checking them, but it would be very nice if 429s could lead to per-host exponential backoff (or tailored backoff if an interval could be parsed from the 429 response, but that is a nice-to-have compared to the necessity of just having a queue to retry these links at progressively longer intervals).

Apr 26 '22 14:04 brunchboy

429 errors with concurrent mode disabled

Describe the bug

To Reproduce

.htmltest.yml

Versions

Hi @ChrisChinchilla :wave:

Hey @robskillington :wave:

Generic option

Specific option

Config naming

Other solutions