hackage-server icon indicating copy to clipboard operation
hackage-server copied to clipboard

Cache causing invalid hash for 01-index

Open chreekat opened this issue 10 months ago • 9 comments

https://github.com/haskell/ghcup-hs/actions/runs/13656361157/job/38176422379?pr=1238#step:5:80

Released cache lock on C:\/Users/runneradmin/AppData/Roaming/stack/pantry/hackage/hackage-security-lock
Verification loop. Errors in order:
  Invalid hash for <repo>/01-index.tar.gz
  Invalid hash for <repo>/01-index.tar.gz
  Invalid hash for <repo>/01-index.tar.gz
  Invalid hash for <repo>/01-index.tar.gz
  Invalid hash for <repo>/01-index.tar.gz

chreekat avatar Mar 04 '25 18:03 chreekat

related to https://github.com/haskell/hackage-security/issues/123

avdv avatar Aug 19 '25 05:08 avdv

As discussed here this issue is caused by inconsistent information being cached for dependent files.

  • timestamp.json contains the hash for snapshot.json
  • snapshot.json contains the hash for 01-index.tar.gz

Now, the JSON files have a max-age of 60 seconds, whereas the tarball has one of 300 seconds.

When the files on the origin server are updated the information which is cached by Fastly / varnish can get out of sync. It could take up to 60 seconds until timestamp and snapshot are consistent again, and it can take up to 300 seconds until snapshot and tarball are consistent again.

Apparently, there are only 5 retries (I guess there is no exponential back-off, right?) and this can fail pretty frequently (depending on how many updates happen on hackage throughout the day).

This could be solved in different ways... probably the best thing would be to sent purge requests to varnish. WDYT?

Maybe there a better place to discuss this?

avdv avatar Sep 25 '25 15:09 avdv

This seems the right place, as the cache control for both the json files and the index tarball is set in the server code. I agree it would be better to pick a uniform value for both -- I think 300 would be fine.

This won't fix all our issues, I think, but would help with one at least.

Suggest we change the 1 minute here to 5: here https://github.com/haskell/hackage-server/blob/8815bc9e27ec25506952f6e41b38d5293d5d31d6/src/Distribution/Server/Features/Security.hs#L155

gbaz avatar Sep 25 '25 19:09 gbaz

This seems the right place, as the cache control for both the json files and the index tarball is set in the server code. I agree it would be better to pick a uniform value for both -- I think 300 would be fine.

IMO that would not solve the problem at all, actually it would make things worse since you would then hit the problem for timestamp.json and snapshot.json more frequently.

Currently we see the error for 01-index.tar.gz most of the time, it is very rare for snapshot.json.

I was thinking this might not be quite the right place to solve this because this seems to be a deployment issue not an issue of how the server works.

avdv avatar Sep 25 '25 20:09 avdv

I'm confused. That 300 that I'm proposing is the max-age for the json files.

We could also cut the max-age from 300 to 60 for the 01-index if you prefer. Note that both max-age values are set in hackage code and not by varnish at the cache layer.

gbaz avatar Sep 25 '25 21:09 gbaz

I'm confused. That 300 that I'm proposing is the max-age for the json files.

In the worst case this would happen:

  1. varnish fetches snapshot.json
  2. files are updated on hackage
  3. varnish fetches timestamp.json

If that happens in quick succession, varnish delivers inconsistent information for the time it considers its information fresh. So if you increase max-age to 300, it would consider snapshot.json fresh for (almost) 300 seconds until it re-fetches it and ends up in a consistent state. During that time clients will not be able to verify the snapshot.json file.

We could also cut the max-age from 300 to 60 for the 01-index if you prefer.

Yes, I think that would mitigate the problem a bit. Assuming that varnishes uses conditional requests this should not increase bandwidth usage significantly, right?

avdv avatar Sep 26 '25 06:09 avdv

On that fix, we can change the uses of maxAgeMinutes in this file https://github.com/haskell/hackage-server/blob/8815bc9e27ec25506952f6e41b38d5293d5d31d6/src/Distribution/Server/Features/Core.hs#L666

gbaz avatar Sep 26 '25 17:09 gbaz

On that fix, we can change the uses of maxAgeMinutes in this file

hackage-server/src/Distribution/Server/Features/Core.hs

Line 666 in 8815bc9 serveLegacyPackagesIndexTarGz :: DynamicPath -> ServerPartE Response

I have created a PR here: #1435

Also, if anybody can have a look at #1431 which potentially fixes another problem and it would be nice to have in a deployment as well.

avdv avatar Sep 29 '25 09:09 avdv

BTW, here is a workflow run which illustrates how annoying this problem is: https://github.com/tweag/rules_haskell/actions/runs/18094346202

Of 52 CI jobs (those that make use of hackage), 26 failed because the cache was in an inconsistent state (invalid hash for 01-index.tar.gz).

Since decreasing the max-age does not really solve the problem, I wish there would be a better solution. How about if any of those files change, send a PURGE to varnish to make it refetch the current version. This way all the files cached should be in-sync. WDYT?

avdv avatar Sep 30 '25 08:09 avdv