nightlies and link validation failing because of repository.apache.org blockage
Our nightlies and link validation sometimes fail when it is ran on a GitHub Actions running that is blocked from repository.apache.org.
Infra seems open to create per-project buckets for the abuse thresholds, but we'd have to add a header to the requests to identify ourselves.
Looks like this would depend on https://github.com/coursier/coursier/issues/1203
I wonder if we could try ordering the resolvers in sbt.
I've seen failures where we get issues loading 3rd party jars because our sbt setup seems to check repository.apache.org before checking maven central. Ideally, repository.apache.org should be checked last.
https://brettporter.wordpress.com/2009/06/16/configuring-maven-http-connections/ suggests you can set a custom user agent header for the requests. We could make use of this, if we come up with a standard format for denoting ASF projects. This would allow us to tailor rules to both be more lenient in these cases, as well as debug which projects or builds are causing issues.
I wonder if we could try ordering the resolvers in sbt.
I've seen failures where we get issues loading 3rd party jars because our sbt setup seems to check repository.apache.org before checking maven central. Ideally, repository.apache.org should be checked last.
I agree that would be a good thing to keep an eye on. 'Normal' CI builds shouldn't reference repository.a.o at all, though, right? And even when including repository.apache.org, I think sbt should use Maven Central first regardless of what additional things we put into resolvers (e.g. https://github.com/sbt/sbt/issues/1138)
https://brettporter.wordpress.com/2009/06/16/configuring-maven-http-connections/ suggests you can set a custom user agent header for the requests
Yes (or arbitrary other headers). Pekko uses sbt instead of mvn to access the Maven repository, though, so that'd need a separate change.
@raboof one source of strain that we put on repository.apache.org is from https://github.com/pjfanning/sbt-pekko-build
This has logic to find the latest snapshot versions by scraping pages served by repository.apache.org.
We haven't seen GitHub Actions runners get blocked anymore by the "too many 404's on repository.apache.org" rule since https://github.com/apache/ranger/pull/435 was merged. I now (ack'ed by infra) removed all those bans.
That should help, but GitHub Actions runners are still being banned for Bugzilla scraping (> 800req/hr to show_bug.cgi). I guess we should look into whether those are 'real' scrapers or some misconfigured job somewhere as well.
looks like this might be bingbot, filed https://issues.apache.org/jira/browse/INFRA-26405 to get a robots.txt in place
(I didn't keep numbers to tell if the lifting of the '404' blocks helped, but at least they haven't reappeared yet. We're still affected by the other blocks, e.g. https://github.com/apache/pekko-persistence-jdbc/actions/runs/12663831987/job/35291029035 )