semantic-conventions
semantic-conventions copied to clipboard
Run link check on the PR files, not the whole repo
It seems we've hit GH rate limits
Should we limit link check to the files updated in the PR and avoid checking all of them?
Listing other alternatives to consider (possibly in addition)
- Run link check on merge queue
- https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/13816#issuecomment-2853431560
I got close with the following config change.
tl;dr: Accept 429s as a "pass". Use a .lycheecache, so won't bother re-checking a URL for given period of time (here configured to 7 days). We're using this on another internal repo, along with a GitHub Actions @actions/cache usage to cache that .lycheecache file between runs.
diff
diff --git a/.lychee.toml b/.lychee.toml
index 9dcde621..38974974 100644
--- a/.lychee.toml
+++ b/.lychee.toml
@@ -1,6 +1,22 @@
+verbose = "info"
include_fragments = true
-accept = ["200..=299", "403"]
+# Cache HTTP status of URLs for 7 days. This is nicer for re-runs and helps with
+# common 429s from github.com. It *does* mean that a link that goes away will
+# not be noticed for a week.
+cache=true
+cache_exclude_status = ["429"]
+max_cache_age="7d"
+
+accept = [
+ "200..=299",
+ "403",
+ # Don't let 429s (common from GitHub) break the link check, otherwise
+ # the cache file from `cache=true` above doesn't get saved by
+ # `actions/cache` in GitHub CI.
+ "429"
+]
+
exclude = [
"^https://www.foo.bar",
@@ -10,15 +26,7 @@ exclude = [
# TODO (lmolkova) treat timeout as not a failure?
"^https://www.intersystems.com",
"^https://gcc.gnu.org",
- "^https://github.com" # to avoid rate limit errors
- # TODO (trask) look at options
]
-# better to be safe and avoid failures
-max_retries = 6
-
-# better to be safe and avoid failures
-timeout = 60
-
# insecure is currently needed for https://osi-model.com
insecure = true
diff --git a/Makefile b/Makefile
index 8d2245e9..d2bdb794 100644
--- a/Makefile
+++ b/Makefile
@@ -129,15 +129,16 @@ normalized-link-check:
fi
.PHONY: markdown-link-check
-markdown-link-check: normalized-link-check
+#XXX
+#markdown-link-check: normalized-link-check
+markdown-link-check:
+ touch .lycheecache # with docker run '-u' usage, we need to create outside the container
$(DOCKER_RUN) --rm \
- $(DOCKER_USER_IS_HOST_USER_ARG) \
+ $(DOCKER_USER_IS_HOST_USER_ARG) \
--mount 'type=bind,source=$(PWD),target=/home/repo' $(LYCHEE_GITHUB_TOKEN_ARG) \
$(LYCHEE_CONTAINER) \
--config home/repo/.lychee.toml \
--root-dir /home/repo \
- --verbose \
- --timeout=60 \
$(MARKDOWN_LINK_CHECK_ARG) \
home/repo
I say "close", because I've stopped on these issues:
- I get
Error: Permission denied (os error 13)when lychee attempts to save.lycheecachefrom inside the container. I believe this is from the$(DOCKER_USER_IS_HOST_USER_ARG)usage. I don't recall/know the reason for using-u ...withdocker run. - I still get some failures here and there, which gets in the way of saving the cache file which is meant to help, e.g.:
[home/repo/docs/database/cassandra.md]:
[ERROR] https://img.shields.io/badge/-development-blue | Network error: error sending request for url (https://img.shields.io/badge/-development-blue) Maybe a certificate error?
[ERROR] https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/dml/dmlConfigConsistency.html | Network error: error sending request for url (https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/dml/dmlConfigConsistency.html) Maybe a certificate error?
[ERROR] https://cassandra.apache.org/ | Network error: error sending request for url (https://cassandra.apache.org/) Maybe a certificate error?
[ERROR] https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/ | Network error: error sending request for url (https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) Maybe a certificate error?
Aside:
- You can ignore the "XXX" change to not run the
normalized-link-checktarget. This target fails on macOS, at least for me, because themodel/[a-z]*shell pattern with the/bin/shshell does include "model/README.md" for some reason I don't understand. With macOSgrepone could instead use--exclude model/README.md, but I couldn't get that to work with the grep in aubuntu:latestcontainer, so I gave up on that.
do we know if 429 indicates the target exists? (i.e. does github return 404 when it doesn't exist, even if you are over the rate limit?)
cc @austinlparker who did something similar in https://github.com/open-telemetry/community/pull/2720
I'd suspect that the rate limit check is before the "does the file exist" check but /shrug
I don't think there's really a great way to fix this, I suspect GitHub is more aggressively rate limiting all HTTP requests as a response to AI bots.
To me it seems like the link problems are mostly caused by old links becoming out of date as they age, not new links being added. I think something like periodically checking links on main (maybe 1/day/repo) is likely good enough? Seems like it shouldn't be too tough to run a link checker on a cron job that opens an issue if it fails. Even if a new bad link is added, it would simply delay its discovery by some number of hours. Not really the end of the world.
I think we have two major classes of issues:
- cross-docs references become invalid because PR changes something (headings, relative path, etc) - this is ~99% of cases
- external links disappear - this is rarely the case, my impression that it happens several times a year. It'd be fine to run the check as a part of release preparation as long as we check internal links on each PR
the minimally sufficient (per-PR) check should include:
- links updated in the PR are valid. This is easy - we can feed the diff to link checker
- links from other docs to the updated docs still work. This is the tricky one.
But either way, in theory it's possible to run internal link check without hitting github.com. We just don't have tooling for it.
As you probably know, another solution to the brittleness of checking external links is to use a cache -- this is what we use for the OTel website. Reference caching is much simpler to implement / enable, and it speeds up the link checking significantly.