semantic-conventions Run link check on the PR files, not the whole repo

It seems we've hit GH rate limits

Should we limit link check to the files updated in the PR and avoid checking all of them?

May 01 '25 23:05 lmolkova

Listing other alternatives to consider (possibly in addition)

Run link check on merge queue
https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/13816#issuecomment-2853431560

May 06 '25 18:05 trask

I got close with the following config change.

tl;dr: Accept 429s as a "pass". Use a .lycheecache, so won't bother re-checking a URL for given period of time (here configured to 7 days). We're using this on another internal repo, along with a GitHub Actions @actions/cache usage to cache that .lycheecache file between runs.

diff

diff --git a/.lychee.toml b/.lychee.toml
index 9dcde621..38974974 100644
--- a/.lychee.toml
+++ b/.lychee.toml
@@ -1,6 +1,22 @@
+verbose = "info"
 include_fragments = true
 
-accept = ["200..=299", "403"]
+# Cache HTTP status of URLs for 7 days. This is nicer for re-runs and helps with
+# common 429s from github.com. It *does* mean that a link that goes away will
+# not be noticed for a week.
+cache=true
+cache_exclude_status = ["429"]
+max_cache_age="7d"
+
+accept = [
+    "200..=299",
+    "403",
+    # Don't let 429s (common from GitHub) break the link check, otherwise
+    # the cache file from `cache=true` above doesn't get saved by
+    # `actions/cache` in GitHub CI.
+    "429"
+]
+
 
 exclude = [
     "^https://www.foo.bar",
@@ -10,15 +26,7 @@ exclude = [
     # TODO (lmolkova) treat timeout as not a failure?
     "^https://www.intersystems.com",
     "^https://gcc.gnu.org",
-    "^https://github.com" # to avoid rate limit errors
-                          # TODO (trask) look at options
 ]
 
-# better to be safe and avoid failures
-max_retries = 6
-
-# better to be safe and avoid failures
-timeout = 60
-
 # insecure is currently needed for https://osi-model.com
 insecure = true
diff --git a/Makefile b/Makefile
index 8d2245e9..d2bdb794 100644
--- a/Makefile
+++ b/Makefile
@@ -129,15 +129,16 @@ normalized-link-check:
 	fi
 
 .PHONY: markdown-link-check
-markdown-link-check: normalized-link-check
+#XXX
+#markdown-link-check: normalized-link-check
+markdown-link-check:
+	touch .lycheecache # with docker run '-u' usage, we need to create outside the container
 	$(DOCKER_RUN) --rm \
-	    $(DOCKER_USER_IS_HOST_USER_ARG) \
+		$(DOCKER_USER_IS_HOST_USER_ARG) \
 		--mount 'type=bind,source=$(PWD),target=/home/repo' $(LYCHEE_GITHUB_TOKEN_ARG) \
 		$(LYCHEE_CONTAINER) \
 		--config home/repo/.lychee.toml \
 		--root-dir /home/repo \
-		--verbose \
-		--timeout=60 \
 		$(MARKDOWN_LINK_CHECK_ARG) \
 		home/repo

I say "close", because I've stopped on these issues:

I get Error: Permission denied (os error 13) when lychee attempts to save .lycheecache from inside the container. I believe this is from the $(DOCKER_USER_IS_HOST_USER_ARG) usage. I don't recall/know the reason for using -u ... with docker run.
I still get some failures here and there, which gets in the way of saving the cache file which is meant to help, e.g.:

[home/repo/docs/database/cassandra.md]:
[ERROR] https://img.shields.io/badge/-development-blue | Network error: error sending request for url (https://img.shields.io/badge/-development-blue) Maybe a certificate error?
[ERROR] https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/dml/dmlConfigConsistency.html | Network error: error sending request for url (https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/dml/dmlConfigConsistency.html) Maybe a certificate error?
[ERROR] https://cassandra.apache.org/ | Network error: error sending request for url (https://cassandra.apache.org/) Maybe a certificate error?
[ERROR] https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/ | Network error: error sending request for url (https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) Maybe a certificate error?

Aside:

You can ignore the "XXX" change to not run the normalized-link-check target. This target fails on macOS, at least for me, because the model/[a-z]* shell pattern with the /bin/sh shell does include "model/README.md" for some reason I don't understand. With macOS grep one could instead use --exclude model/README.md, but I couldn't get that to work with the grep in a ubuntu:latest container, so I gave up on that.

May 07 '25 19:05 trentm

do we know if 429 indicates the target exists? (i.e. does github return 404 when it doesn't exist, even if you are over the rate limit?)

cc @austinlparker who did something similar in https://github.com/open-telemetry/community/pull/2720

May 08 '25 14:05 trask

I'd suspect that the rate limit check is before the "does the file exist" check but /shrug

May 08 '25 14:05 austinlparker

I don't think there's really a great way to fix this, I suspect GitHub is more aggressively rate limiting all HTTP requests as a response to AI bots.

May 08 '25 14:05 austinlparker

To me it seems like the link problems are mostly caused by old links becoming out of date as they age, not new links being added. I think something like periodically checking links on main (maybe 1/day/repo) is likely good enough? Seems like it shouldn't be too tough to run a link checker on a cron job that opens an issue if it fails. Even if a new bad link is added, it would simply delay its discovery by some number of hours. Not really the end of the world.

May 12 '25 15:05 dyladan

I think we have two major classes of issues:

cross-docs references become invalid because PR changes something (headings, relative path, etc) - this is ~99% of cases
external links disappear - this is rarely the case, my impression that it happens several times a year. It'd be fine to run the check as a part of release preparation as long as we check internal links on each PR

May 12 '25 16:05 lmolkova

the minimally sufficient (per-PR) check should include:

links updated in the PR are valid. This is easy - we can feed the diff to link checker
links from other docs to the updated docs still work. This is the tricky one.

But either way, in theory it's possible to run internal link check without hitting github.com. We just don't have tooling for it.

May 12 '25 16:05 lmolkova

As you probably know, another solution to the brittleness of checking external links is to use a cache -- this is what we use for the OTel website. Reference caching is much simpler to implement / enable, and it speeds up the link checking significantly.

Jun 01 '25 12:06 chalin

semantic-conventions semantic-conventions copied to clipboard

Run link check on the PR files, not the whole repo

semantic-conventions
semantic-conventions copied to clipboard