eve icon indicating copy to clipboard operation
eve copied to clipboard

if eve CI job fails to get cache hit, request rerun of all jobs

Open deitch opened this issue 3 years ago • 5 comments

Our CI build.yaml has 2 jobs:

  1. packages
  2. eve

packages builds once on each architecture, while eve builds once for each target (matrix of architecture and hypervisor). The output of packages is cached using GitHub Actions cache, and then loaded into eve.

GHA caches have a maximum size with FIFO. As more caches get used, the oldest one is removed. That means that if we wait long enough from running packages until we run eve, the cached packages are timed out, eve gets a cache miss, and the job fails.

Normally this does not happen, since eve jobs get run immediately after packages. But since we can rerun just one job a day or two later, we can get a silent cache miss. This will cause later steps to fail for reasons that are unclear. It looks like a linuxkit cache or docker problem, but it really is a GitHub Actions cache problem.

This PR adds an explicit check on the eve job GHA cache restore. If we get a cache miss, it immediately fails with an error message to rerun them all.

Note that we could use artifacts instead of cache to make these more persistent, but that goes against the grain, and will end up publishing them where we do not want to. cache is the right thing to use; we just need to capture cache misses.

deitch avatar Oct 13 '22 09:10 deitch

Can we update publish workflow as well? And also we have two attempts to pull the cache here, we should check both of them.

giggsoff avatar Oct 13 '22 09:10 giggsoff

Can we update publish workflow as well?

Where do we use actions/cache in publish workflow?

deitch avatar Oct 13 '22 09:10 deitch

Can we update publish workflow as well?

Where do we use actions/cache in publish workflow?

Ah, sorry, my mistake, I just remember your changes were about publish workflow https://github.com/lf-edge/eve/pull/2825

giggsoff avatar Oct 13 '22 10:10 giggsoff

Strange that unit tests failed, but that should not affect us

deitch avatar Oct 13 '22 10:10 deitch

Build process is happy now, minus that spurious unit test failure

deitch avatar Oct 13 '22 10:10 deitch