diffkemp icon indicating copy to clipboard operation
diffkemp copied to clipboard

CI problems (429 error, cache size)

Open PLukas2018 opened this issue 1 year ago • 1 comments

So lately there have been problems with CI:

  1. 429 Error when restoring kernels Caused by too many requests to GitHub API (eg. the cache API) - according to this there is a limit for API calls:

    API requests - You can execute up to 1,000 requests to the GitHub API in an hour across all actions within a repository. If requests are exceeded, additional API calls will fail which might cause jobs to fail.

    which it looks like we sometimes hit.

  2. Not sure if it could be connected but there is also a limit for cache:

    the total size of all caches in a repository is limited to 10 GB

    The problem connected with this is that the cache entries are not shared between branches.

    ... Workflow runs can restore caches created in either the current branch or the default branch (usually main). ...

    So in our case if the kernels (which tooks 3.4 GB) are not cached from master then for each PR branch they are cached seperately which probably causes quickly to exceed the cache limits if there is more active PRs.

  3. Maybe there is different problem. I think that currently the problem could be caused also with the update of flake.nix, where there are some branches using the old dependencies and some new which fills up the cache quickly.

PLukas2018 avatar May 17 '24 10:05 PLukas2018

Thanks for the nice analysis!

So lately there have been problems with CI:

  1. 429 Error when restoring kernels Caused by too many requests to GitHub API (eg. the cache API) - according to this there is a limit for API calls:

    API requests - You can execute up to 1,000 requests to the GitHub API in an hour across all actions within a repository. If requests are exceeded, additional API calls will fail which might cause jobs to fail.

    which it looks like we sometimes hit.

Interesting, I didn't know about this limit, it looks like it should be related to the caches.

  1. Not sure if it could be connected but there is also a limit for cache:

    the total size of all caches in a repository is limited to 10 GB

    The problem connected with this is that the cache entries are not shared between branches.

    ... Workflow runs can restore caches created in either the current branch or the default branch (usually main). ...

IIUC, caches for different PRs should be shared b/c they are taken from the target branch (i.e. master). Since most people develop in their own branches on their forks (including me now), branch workflows should run in forks (and therefore have separate limits).

So in our case if the kernels (which tooks 3.4 GB) are not cached from master then for each PR branch they are cached seperately which probably causes quickly to exceed the cache limits if there is more active PRs.

Still, 3.4GB is a lot so the cache limit can be hit quicky (caches are valid for 30 days). So, I'm going to clean up all the caches (running now) and let's see if the problems disappear.

  1. Maybe there is different problem. I think that currently the problem could be caused also with the update of flake.nix, where there are some branches using the old dependencies and some new which fills up the cache quickly.

All in all, I think that it would be good if I progressed with my refactoring of tests which should significantly decrease the amount of cached files.

viktormalik avatar May 17 '24 15:05 viktormalik

This was solved by #351. Currently, it looks like there is no problem with CI (caching, restoring kernels).

PLukas2018 avatar Aug 18 '25 19:08 PLukas2018