Cache causing invalid hash for 01-index
https://github.com/haskell/ghcup-hs/actions/runs/13656361157/job/38176422379?pr=1238#step:5:80
Released cache lock on C:\/Users/runneradmin/AppData/Roaming/stack/pantry/hackage/hackage-security-lock
Verification loop. Errors in order:
Invalid hash for <repo>/01-index.tar.gz
Invalid hash for <repo>/01-index.tar.gz
Invalid hash for <repo>/01-index.tar.gz
Invalid hash for <repo>/01-index.tar.gz
Invalid hash for <repo>/01-index.tar.gz
related to https://github.com/haskell/hackage-security/issues/123
As discussed here this issue is caused by inconsistent information being cached for dependent files.
timestamp.jsoncontains the hash forsnapshot.jsonsnapshot.jsoncontains the hash for01-index.tar.gz
Now, the JSON files have a max-age of 60 seconds, whereas the tarball has one of 300 seconds.
When the files on the origin server are updated the information which is cached by Fastly / varnish can get out of sync. It could take up to 60 seconds until timestamp and snapshot are consistent again, and it can take up to 300 seconds until snapshot and tarball are consistent again.
Apparently, there are only 5 retries (I guess there is no exponential back-off, right?) and this can fail pretty frequently (depending on how many updates happen on hackage throughout the day).
This could be solved in different ways... probably the best thing would be to sent purge requests to varnish. WDYT?
Maybe there a better place to discuss this?
This seems the right place, as the cache control for both the json files and the index tarball is set in the server code. I agree it would be better to pick a uniform value for both -- I think 300 would be fine.
This won't fix all our issues, I think, but would help with one at least.
Suggest we change the 1 minute here to 5: here https://github.com/haskell/hackage-server/blob/8815bc9e27ec25506952f6e41b38d5293d5d31d6/src/Distribution/Server/Features/Security.hs#L155
This seems the right place, as the cache control for both the json files and the index tarball is set in the server code. I agree it would be better to pick a uniform value for both -- I think 300 would be fine.
IMO that would not solve the problem at all, actually it would make things worse since you would then hit the problem for timestamp.json and snapshot.json more frequently.
Currently we see the error for 01-index.tar.gz most of the time, it is very rare for snapshot.json.
I was thinking this might not be quite the right place to solve this because this seems to be a deployment issue not an issue of how the server works.
I'm confused. That 300 that I'm proposing is the max-age for the json files.
We could also cut the max-age from 300 to 60 for the 01-index if you prefer. Note that both max-age values are set in hackage code and not by varnish at the cache layer.
I'm confused. That 300 that I'm proposing is the max-age for the json files.
In the worst case this would happen:
- varnish fetches
snapshot.json - files are updated on hackage
- varnish fetches
timestamp.json
If that happens in quick succession, varnish delivers inconsistent information for the time it considers its information fresh. So if you increase max-age to 300, it would consider snapshot.json fresh for (almost) 300 seconds until it re-fetches it and ends up in a consistent state. During that time clients will not be able to verify the snapshot.json file.
We could also cut the max-age from 300 to 60 for the 01-index if you prefer.
Yes, I think that would mitigate the problem a bit. Assuming that varnishes uses conditional requests this should not increase bandwidth usage significantly, right?
On that fix, we can change the uses of maxAgeMinutes in this file https://github.com/haskell/hackage-server/blob/8815bc9e27ec25506952f6e41b38d5293d5d31d6/src/Distribution/Server/Features/Core.hs#L666
On that fix, we can change the uses of
maxAgeMinutesin this filehackage-server/src/Distribution/Server/Features/Core.hs
Line 666 in 8815bc9 serveLegacyPackagesIndexTarGz :: DynamicPath -> ServerPartE Response
I have created a PR here: #1435
Also, if anybody can have a look at #1431 which potentially fixes another problem and it would be nice to have in a deployment as well.
BTW, here is a workflow run which illustrates how annoying this problem is: https://github.com/tweag/rules_haskell/actions/runs/18094346202
Of 52 CI jobs (those that make use of hackage), 26 failed because the cache was in an inconsistent state (invalid hash for 01-index.tar.gz).
Since decreasing the max-age does not really solve the problem, I wish there would be a better solution. How about if any of those files change, send a PURGE to varnish to make it refetch the current version. This way all the files cached should be in-sync. WDYT?