eve if eve CI job fails to get cache hit, request rerun of all jobs

Our CI build.yaml has 2 jobs:

packages
eve

packages builds once on each architecture, while eve builds once for each target (matrix of architecture and hypervisor). The output of packages is cached using GitHub Actions cache, and then loaded into eve.

GHA caches have a maximum size with FIFO. As more caches get used, the oldest one is removed. That means that if we wait long enough from running packages until we run eve, the cached packages are timed out, eve gets a cache miss, and the job fails.

Normally this does not happen, since eve jobs get run immediately after packages. But since we can rerun just one job a day or two later, we can get a silent cache miss. This will cause later steps to fail for reasons that are unclear. It looks like a linuxkit cache or docker problem, but it really is a GitHub Actions cache problem.

This PR adds an explicit check on the eve job GHA cache restore. If we get a cache miss, it immediately fails with an error message to rerun them all.

Note that we could use artifacts instead of cache to make these more persistent, but that goes against the grain, and will end up publishing them where we do not want to. cache is the right thing to use; we just need to capture cache misses.

Oct 13 '22 09:10 deitch

Can we update publish workflow as well? And also we have two attempts to pull the cache here, we should check both of them.

Oct 13 '22 09:10 giggsoff

Can we update publish workflow as well?

Where do we use actions/cache in publish workflow?

Oct 13 '22 09:10 deitch

Can we update publish workflow as well?

Where do we use actions/cache in publish workflow?

Ah, sorry, my mistake, I just remember your changes were about publish workflow https://github.com/lf-edge/eve/pull/2825

Oct 13 '22 10:10 giggsoff

Strange that unit tests failed, but that should not affect us

Oct 13 '22 10:10 deitch

Build process is happy now, minus that spurious unit test failure

Oct 13 '22 10:10 deitch