pekko nightlies and link validation failing because of repository.apache.org blockage

Our nightlies and link validation sometimes fail when it is ran on a GitHub Actions running that is blocked from repository.apache.org.

Infra seems open to create per-project buckets for the abuse thresholds, but we'd have to add a header to the requests to identify ourselves.

Looks like this would depend on https://github.com/coursier/coursier/issues/1203

Dec 16 '24 11:12 raboof

I wonder if we could try ordering the resolvers in sbt.

I've seen failures where we get issues loading 3rd party jars because our sbt setup seems to check repository.apache.org before checking maven central. Ideally, repository.apache.org should be checked last.

Dec 16 '24 11:12 pjfanning

https://brettporter.wordpress.com/2009/06/16/configuring-maven-http-connections/ suggests you can set a custom user agent header for the requests. We could make use of this, if we come up with a standard format for denoting ASF projects. This would allow us to tailor rules to both be more lenient in these cases, as well as debug which projects or builds are causing issues.

Dec 16 '24 11:12 Humbedooh

I wonder if we could try ordering the resolvers in sbt.

I've seen failures where we get issues loading 3rd party jars because our sbt setup seems to check repository.apache.org before checking maven central. Ideally, repository.apache.org should be checked last.

I agree that would be a good thing to keep an eye on. 'Normal' CI builds shouldn't reference repository.a.o at all, though, right? And even when including repository.apache.org, I think sbt should use Maven Central first regardless of what additional things we put into resolvers (e.g. https://github.com/sbt/sbt/issues/1138)

https://brettporter.wordpress.com/2009/06/16/configuring-maven-http-connections/ suggests you can set a custom user agent header for the requests

Yes (or arbitrary other headers). Pekko uses sbt instead of mvn to access the Maven repository, though, so that'd need a separate change.

Dec 16 '24 13:12 raboof

@raboof one source of strain that we put on repository.apache.org is from https://github.com/pjfanning/sbt-pekko-build

This has logic to find the latest snapshot versions by scraping pages served by repository.apache.org.

Dec 31 '24 10:12 pjfanning

We haven't seen GitHub Actions runners get blocked anymore by the "too many 404's on repository.apache.org" rule since https://github.com/apache/ranger/pull/435 was merged. I now (ack'ed by infra) removed all those bans.

That should help, but GitHub Actions runners are still being banned for Bugzilla scraping (> 800req/hr to show_bug.cgi). I guess we should look into whether those are 'real' scrapers or some misconfigured job somewhere as well.

Jan 03 '25 10:01 raboof

looks like this might be bingbot, filed https://issues.apache.org/jira/browse/INFRA-26405 to get a robots.txt in place

Jan 03 '25 10:01 raboof

(I didn't keep numbers to tell if the lifting of the '404' blocks helped, but at least they haven't reappeared yet. We're still affected by the other blocks, e.g. https://github.com/apache/pekko-persistence-jdbc/actions/runs/12663831987/job/35291029035 )

Jan 08 '25 08:01 raboof