scala-steward icon indicating copy to clipboard operation
scala-steward copied to clipboard

More strategies for (org-level) throttling of PRs

Open armanbilge opened this issue 2 years ago • 14 comments

For example, scalafmt v2.7.0 was released today. When the Typelevel Steward ran, essentially every typelevel-org repo got an update PR. Whenever this happens, CI easily gets backlogged for hours, since we have an org-wide limit of 20 concurrent runners.

So we are looking for strategies to pace the rate at which Steward makes PRs, ideally so that there are always some CI runners free for human PRs, releases, and anything else that may be happening concurrently.

One idea we had is to be able to limit the number of PRs that Steward can make during a run. Actually that's what @valencik and I thought the updates.limit config would do in https://github.com/typelevel/steward/pull/18, where we hoped to configure Steward to make no more than 1 PR every 20 minutes. We only realized after-the-fact that limit applies per-repo, not for the entire run.

Edit: while writing this, http4s org just got flooded with the scalafmt update as well 🙂

armanbilge avatar Jan 19 '23 01:01 armanbilge

One possibility would be to add a rate limiter that ensures that a fixed duration elapsed between two consecutive created PRs. If Scala Steward would be too fast, it could just sleep and wait before creating the next PR. This rate limiter could also be used to deal with GitHub's secondary rate limit (#2355). The duration would be config option that applies to the entire run.

Would this work for Typelevel Steward?

fthomas avatar Jan 19 '23 15:01 fthomas

Thanks!

it could just sleep and wait before creating the next PR

Essentially an IO.sleep(...)? My concern is that this means the process would still be running, which means we are down one CI runner during that time. We'd also have to be careful about GitHub runner's 6 hour limit so it doesn't get killed and can shutdown gracefully.

armanbilge avatar Jan 19 '23 15:01 armanbilge

That is good argument against IO.sleep. The limiter could also just skip PR creation and proceed as normal until enough time elapsed since the last PR creation. If the duration is set to a value that is longer than a typical Steward run, this would be essentially a global updates.limit of 1.

This rate limiter could still be used for #2355.

fthomas avatar Jan 19 '23 15:01 fthomas

Bonus points: if each repository can indicate in its config how much resources a PR consumes. So that the limiter can be dynamic, based on where the updates are going.

armanbilge avatar Jan 19 '23 16:01 armanbilge

The creation of the PRs it not the problem (except the secondary rate limit), but that the CI jobs are using up all available resources. I was browsing the workflow docs to see, if there is a way to place all the Steward PRs onto a dedicated queue that has either a very low priority or only a limited amount of runners. But sadly this doesn't seem to be possible.

mzuehlke avatar Jan 19 '23 16:01 mzuehlke

Bonus points: if each repository can indicate in its config how much resources a PR consumes. So that the limiter can be dynamic, based on where the updates are going.

Sounds good to me. Do you have something specific in mind? Maybe a number between 0.0 and 10.0 that is multiplied with the initial duration to increase or decrease the time until the next PR can be created?

fthomas avatar Jan 19 '23 19:01 fthomas

Good question. One simple way would be for each repo to indicate how long its CI takes to run, and that's how long Steward should wait.


More complex would be that each repo indicates the total resources it consumes e.g. CI minutes x runners used (or more precisely, the total sum of CI minutes used, since some cells in the CI matrix may run quicker).

Then, Steward gets a budget of how many CI runners it's allowed to use.

So when it opens a PR, it calculates (repo CI consumption / runner budget) and that tells it how long to wait until opening the next PR.

Maybe a number between 0.0 and 10.0 that is multiplied with the initial duration

Maybe this is another way of phrasing what I wrote :)

armanbilge avatar Jan 19 '23 20:01 armanbilge

You could also use the new grouping feature, to ensure only 1 PR is created per-repository, and thus decreasing the amount of jobs

alejandrohdezma avatar Jan 20 '23 09:01 alejandrohdezma

You can also use a matrix to update each repository in its own job (each repository will have its own Scala Steward Cache). I do it here for my own repositories and it decreased a lot the amount of time a job takes to complete.

alejandrohdezma avatar Jan 20 '23 09:01 alejandrohdezma

You can also use a matrix to update each repository in its own job (each repository will have its own Scala Steward Cache). I do it here for my own repositories and it decreased a lot the amount of time a job takes to complete.

As that runs all workers in parallel, the speed up will be quite significant yes - downside on this is it doesn't help to work around any rate-limits:

The rate limits described above apply to the entire REST API and are per-user or per-app

source

Seems scala-steward already knows about the secondary limits as I'm hitting this exception:

org.scalasteward.core.forge.github.GitHubException$SecondaryRateLimitExceeded

So one way could be if that exception is thrown, just wait until value of x-ratelimit-reset.

Additionally, if retry-after is found in the response scala-steward should maybe blindly follow that as well?

From that same docs page:

If the retry-after response header is present, you should not retry your request until after that many seconds has elapsed. Otherwise, you should not retry your request until the time, in UTC epoch seconds, specified by the x-ratelimit-reset header.

harmw avatar Feb 28 '23 12:02 harmw

So one way could be if that exception is thrown, just wait until value of x-ratelimit-reset.

I like this idea, yeah.

alejandrohdezma avatar Feb 28 '23 12:02 alejandrohdezma

So one way could be if that exception is thrown, just wait until value of x-ratelimit-reset.

I think x-ratelimit-reset only applies to the primary rate limit. At least there is nothing in the docs that tell us that it also applies to the secondary rate limit.

Additionally, if retry-after is found in the response scala-steward should maybe blindly follow that as well?

I think it already does it but Retry-After is not present if the secondary rate limit is exceeded, see the headers in https://github.com/scala-steward-org/scala-steward/issues/2355#issue-1053725114.

fthomas avatar Feb 28 '23 12:02 fthomas

At least there is nothing in the docs that tell us that it also applies to the secondary rate limit.

I'm wrong. It is mentioned in https://docs.github.com/en/rest/guides/best-practices-for-integrators?apiVersion=2022-11-28#dealing-with-secondary-rate-limits

I think that wasn't the case two years ago. :-)

fthomas avatar Feb 28 '23 12:02 fthomas

The note wasn't there in the past.

mzuehlke avatar Feb 28 '23 15:02 mzuehlke