cache icon indicating copy to clipboard operation
cache copied to clipboard

Cache miss because of different compression.

Open kindaro opened this issue 2 years ago • 4 comments

I have a deterministic failure: a cache miss across two workflows. It has taken me a whole day to find the fault behind the failure, and still I have no way to help it.

the setting

I have set up a repository specifically to work on this issue. It will be our running example. https://github.com/kindaro/release The logs be my witness.

the task

The task is to write a workflow release.yml that makes a draft GitHub release, builds some executables and attaches them to said release. This workflow waits for another workflow haskell-ci.yml to complete. release.yml should pick up the cache made by haskell-ci.yml whenever it is there. It is known to be there when the target operating system is Linux.

the failure

The failure is that the cache made by haskell-ci.yml is not seen by release.yml.

the hunt

I have undertaken the following steps to find the fault behind the failure:

  1. Check if the cache is not seen across workflows because the action is triggered by a tag.

    I made the action run on master. However, the failure still showed up.

  2. Check that the key is matching.

    To this end, I replaced the key in both workflows by the same string without variables. However, the failure still showed up.

  3. Look at the caches through GitHub API.

    This is the URL I looked at. https://api.github.com/repos/kindaro/release/actions/caches I spotted that there are caches with the same key but different versions. This is an error. However, this error may have any of a few roots: wrong paths, wrong compression, wrong version of the action.

  4. Check if the version of the action is the same for both workflows.

    I spotted that it is different! I set it to be the same. However, the failure still showed up.

  5. Reverse engineer the version values.

    The version values are hashes, defined in an upstream library. With some trial, I found how the version values of my caches are made:

    > [["gzip"], ["zstd"], ["zstd-without-long"], [ ]].map((compression) => {return (require('crypto').createHash('sha256').update(["~/.cabal/store"].concat(compression).concat(["1.0"]).join('|')).digest('hex'))})
    [
      '97150208f15627752f4bcfa20bf9811d3688b5b274ffe014984351184e875a74',
      'bb4f75bb6ca5843bf5c49253ee4d5d67796506fa9441ca6cffe69d7960a2bcd4',
      '27a066cbe2873e20fde52127b6017ab7615ab4b954a020235b8fe7737035cbca',
      '1cf722fcd6cb72f17b62fd7954c4ae9118802bdb10b3ef077b4ec06d1f66bfbe'
    ]
    

    You may see that the latter two hashes match the version values of my caches. So, the error is that one workflow does compression with gzip and another with zstd. Why could that happen? The workflows must be running on different operating systems.

  6. Find out if the workflows are running on the same operating system.

    I checked my workflows and it turned out that:

    • haskell-ci.yml specifies ubuntu-18.04
    • release.yml specifies ubuntu-latest

    I set both to run on ubuntu-18.04. However, the failure still showed up.

  7. Look at all the logs again and see if the compression correlates with the name of the workflow.

    Truly it turns out that release.yml always does zstd but haskell-ci.yml always does gzip.

  8. Switch on debug output.

    Debug output told me that zstd is not found in the environment run by haskell-ci.yml.

  9. Check if zstd is killed by a forecoming step.

    I inserted steps that check if zstd is there. It is not there from the start.

  10. Look again at haskell-ci.yml.

    haskell-ci.yml specifies a container, so all jobs are run in that container. Who knows why, but this container does not have zstd. Therefore, the action picks a different compression. Therefore, version of the cache is different. Therefore, cache miss. This is my conclusion from evidence.

the outcome

You now see how hard it is to find the fault behind a failure like this, where at every step of the way the view is further obscured. The fault is not local — it is found in another workflow script, and in another section thereof.

  • It is not clear what the version hash for a given cache is. The action does not say anything about it. This is easy to make better: make the action say what the version hash is, every time. On top of that — on a cache miss, make the action say a warning if there are caches which key matches, but which version hash does not match.
  • It is not clear what the version hash is made of. It is a hash — it cannot be undone, only guessed at. How far easier it would be if version was a JSON string. Is there a justification from security for this hashing? If so, the action should say what it makes the version hash from, when it makes a cache, so that I can spot the difference.

Even when the fault is found, I now have to go to some length to run release.yml in the same container as haskell-ci.yml, or delete zstd, or take some other otherwise needless steps.

  • There is no way to tell the action what compression I want.

Can this be improved?

kindaro avatar Sep 07 '22 16:09 kindaro

Thanks for deep investigation and sharing details!

There is a commentary about cache version in README. Hopefully this was of some help to you!

This is easy to make better: make the action say what the version hash is, every time. On top of that — on a cache miss, make the action say a warning if there are caches which key matches, but which version hash does not match.

This sounds reasonable. What makes it difficult is that version is a hash and hence very difficult to make sense of it. It should not end up as more information which is not much useful.

it is not clear what the version hash is made of. It is a hash — it cannot be undone, only guessed at. How far easier it would be if version was a JSON string. Is there a justification from security for this hashing?

Atleast one reason I know: Version is made up of, among other things, path used in action. It can be multiple paths and hence will have problems associated with unbounded strings - will leads to collision due to truncation or escaping special chars. However definitely something to consider.

bishal-pdMSFT avatar Sep 11 '22 16:09 bishal-pdMSFT

There is a commentary about cache version in README. Hopefully this was of some help to you!

Yes, there is an overview in README, it was somewhat helpful. It would be better if it mentioned where to find out the caches and their hashes. (A link to GitHub API.) It could also point to the function that does the hashing. (I linked it above.) And then the action should report what ingredients went into the hash and the resulting hash whenever it creates a new cache or tries to access an existing cache. This is of course quite technical, but maybe with a sprinkle of HTML it can be made small print or folded.


What about adding an option to force a specific compression? I am still stuck on my problem and this option would help.

kindaro avatar Sep 11 '22 20:09 kindaro

I spent way too much time this week trying to figure this out. I am running cache on 2 different runners. The key is identical (includes the git SHA), so it looks like it is saving the same one, but it isn't. I tried different paths, everything. And then I realized that the compression is different, as the self-hosted runner doesn't have zstd.

@bishal-pdMSFT I think what would be enormously helpful is if it can report the version hash and the parts that make it up. At least that way, we can look at cache saves and restores and have some basis for comparison.

deitch avatar Sep 25 '22 10:09 deitch

We've also seen a cache miss that I tracked down to this issue. It would be great if this was slightly more debuggable and didn't rely on eyeballing the tar command across actions but had some explicit information.

make the action say a warning if there are caches which key matches, but which version hash does not match. This suggestion would make it super obvious, so 👍🏽 from me.

https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows is where I tend to go to remember how caches work across branches, as well as the storage limit / eviction, i.e. that's where I go to fundamentally understand how Github's caches work. Maybe not the best page for it either, but it comes closer than the docs for this action imo.

ramblingenzyme avatar Oct 13 '22 03:10 ramblingenzyme

README is updated to add more details about how version is computed #971

bishal-pdMSFT avatar Nov 07 '22 10:11 bishal-pdMSFT

I am a bit underwhelmed by the solution.

kindaro avatar Nov 07 '22 12:11 kindaro

Same here. Why wouldn't you include enough information in the actual reporting? When caching, "here is the actual key I used to save" and when looking, "here is the key I am looking for and didn't find."

deitch avatar Nov 07 '22 12:11 deitch

There is a thinking to make the action logs more verbose to include details about version mismatch. But right now I am not too sure about timeline for that. That's why I closed the issue as the docs update are addressing it to certain extent.

cc @Phantsure if you plan to track this then you can chose to re-open this issue.

bishal-pdMSFT avatar Nov 07 '22 17:11 bishal-pdMSFT

I plan to add more verbose to include details about version mismatch. As version mismatch can occur due to either compression or path. From my understanding details asked so far are:

  • Provide all version hashes present for key in case of mismatch.
    • Feels more like a log to go in debug log. Only seen when running in debug mode.
  • After a mismatch, tell that reasons can be different, i.e. compression or path. Let user figure out the actual reason.
    • Feels more like this would be useful for most of the people.

Phantsure avatar Nov 10 '22 16:11 Phantsure

Have added more details about cache version and a list of other cache with same key but different versions and scopes to better debug: https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/cacheHttpClient.ts#L132 Closing this issue as resolved

Phantsure avatar Dec 23 '22 09:12 Phantsure