htmltest
htmltest copied to clipboard
429 errors with concurrent mode disabled
Describe the bug
I can see there are similar reports on this issue (https://github.com/wjdp/htmltest/issues/105), but I have the same result even without concurrent mode enabled. Interestingly, the checks work beautifully when run locally, but we get the error when it runs on CI jobs.
Non-OK status: 429 --- docs/operational_guide/resource_limits/index.html --> https://github.com/m3db/m3/tree/docs-test/site/content/docs/operational_guide/resource_limits.md
etc. And yes, it's just GitHub links.
To Reproduce
htmltest is running in CI over a Hugo site like this.
docker run --rm -it -v $(PWD)/site:/src klakegg/hugo:ext-alpine
curl https://htmltest.wjdp.uk | bash
./bin/htmltest -c site/.htmltest.yml
.htmltest.yml
Please copy in your config file
DirectoryPath: site/public
IgnoreDirectoryMissingTrailingSlash: true
CheckExternal: true
IgnoreAltMissing: false
CheckImages: false
CheckScripts: false
CheckMeta: false
CheckMetaRefresh: false
CheckLinks: false
EnforceHTML5: false
EnforceHTTPS: false
IgnoreInternalEmptyHash: true
IgnoreEmptyHref: true
TestFilesConcurrently: false
DocumentConcurrencyLimit: 64
HTTPConcurrencyLimit: 8
LogLevel: 0
IgnoreURLs:
- "youtu.be"
- "youtube.com"
- "cassandra.apache.org"
Versions
- OS: local - macOS 10.15.7 (19H2), CI - Not completely sure… Sorry, Linux of some flavour
- htmltest: htmltest 0.13.0
@wjdp would you be open to accepting a PR that added a self-imposed rate limit for certain URLs?
e.g. something like this (what do you think of the general concept if someone else was to open the change?)
RateLimit:
- URL: "github.com"
MaxRequests: 10
Within: 1s
Hi @ChrisChinchilla :wave:
From "works locally, fails on CI" I suspect your local broadband bottlenecks you enough that rate limiting isn't a problem there but the fat pipe at your CI provider allows for a much more effective DoS directed at github :wink: hence the 429s.
This is a problem with htmltest, right now we just scale to whatever bandwidth is available even if that hits external rate limits.
Hey @robskillington :wave:
Yes, this sounds like the solution we need! I think we probably want a generic option first rather than a specific option but am open to either really. Very happy for someone else to pick this up!
Generic option
Rate limit connections per hostname without needing to specify it per host in config. You'd set a limit in config, or perhaps we have a sensible default, which is applied to all hostnames.
Specific option
As you suggest. Or perhaps both? A default and custom overrides.
Config naming
Likely the most contentious part of a PR for this would be the config names.
- Your suggestion looks alright, but how do we handle the generic/default case (if there is to be one, or for one in the future)
- Separate
MaxRequests
andWithin
vars is quite flexible, but could we do with assuming any rate limit is going to be per second? Or do we leaveWithin
in there and just default it to1s
so most users don't need to think about it.
Perhaps something like
RateLimit:
- Default: true
MaxRequests: 20
Within: 1s [optional]
- URL: "github.com"
MaxRequests: 10
Within: 1s [optional]
What do you think?
Other solutions
A thought I had half way through writing this reply:
If we see a 429 we could do an exponential backoff to attempt to recover. This would solve the problem without having to resort to hard-coded rate limits which could still break by hitting unpredictable rate limits.
Arguably this could be a better solution as it will fix the 429 problem for everyone without them having to set up rate limits nor htmltest having to impose a default rate limit.
We may find ourselves in a position of needing to do both as some servers likely aren't as nice as GitHub with their status codes though.
I'm sure there's a Go package that could handle this for us somewhat easily. A cursory Google throws up https://github.com/cenkalti/backoff
OK thanks @wjdp we (@robskillington and I work together) ended up getting it working on CI by reducing the links checked to a bare minimum, but hopefully Rob's suggestion will solve the issue for others in the long term :)
I wonder if HEAD requests are not affected by rate limiting? I find that using Get with Range=0 is not great for link checking...
Hi @robskillington I take it you never started working on this? Or is there anything someone can continue to work on for that issue?
I'm getting this too. Doesn't just apply to GitHub but to lots of other endpoints too. It would be useful to have a generic retry for these, but specifying specific sites would also work (although more work for me). Is there any caching of results? I see the same endpoint returning 429 multiple times. Does htmltest check http://example.com/page.html every time if finds it, or does it remember the first result and return it from the cache next time? If there is caching it might help to reduce the number of 429 errors.
I have just run into this in a major way trying to check links on https://afterglow-guide.deepsymmetry.org/ since it has many dozens of links to source code lines in the GitHub repository. For now, I am going to have to disable checking them, but it would be very nice if 429s could lead to per-host exponential backoff (or tailored backoff if an interval could be parsed from the 429 response, but that is a nice-to-have compared to the necessity of just having a queue to retry these links at progressively longer intervals).