faraday-http-cache icon indicating copy to clipboard operation
faraday-http-cache copied to clipboard

Cache entries can grow unbounded

Open casperisfine opened this issue 6 years ago • 6 comments

Initially reported in https://github.com/Shopify/shipit-engine/issues/935 by @josacar

Context

Shipit is a deploy tool, as such it's frequently hitting the same endpoint to grab updates, e.g GET https://api.github.com/v3/repos/my-org/my-repo/commits?branch=master.

Additionally it uses the GitHub App api for authentication, so it's Authorization header changes every hour.

And finally GitHub's API responses specify the following Vary header: Accept, Authorization, Cookie, X-GitHub-OTP.

Problem

Since the URL never changes, the same cache entry is re-used all the time. However since the Authorization token change every hour, new entries are regularly prepended, until the cache entry becomes huge, and start taking a lot of time to be deserialized, or even go over the cache store size (memcached limit values to 1MB).

Solutions

I can see several solutions to this, with very different tradeoffs.

Very simple, set a cap on the number of entries per key

Self explanatory, we could simply set a configurable upper limit on the number of entries, and drop the oldest entry whenever we go over.

It's an easy fix, not quite ideal as it might surprise people, but IMHO it's better to be surprised by cache evictions than by cache OOM / corruption (in case of memcache truncation).

Somewhat simple, evict stale entries

When prepending a new entry, we could evict all the stale ones. It's easy if max-age / Expires is set, if it isn't, then we have to rely on heuristics as specify in the RFC

Harder, and potentially slower, store varying responses under different keys

Since Faraday::HttpCache implicitly relies on the underlying cache store for garbage collection, the idea here would be to store individual responses under different keys, so that rarely accessed ones can be garbage collected.

The very tricky part is that the Vary is in the response, so it means maintaining some kind of non-trivial index of what kind of Vary values were returned for that response, and then use that to select a good candidate for revalidation.

Additionally it means that it will at least double the number of cache lookups, which is not a small downside.

cc @rafaelfranca @Edouard-Chin @josacar

casperisfine avatar Aug 29 '19 14:08 casperisfine

I think relying on the cache store is the way to go, I think the same as we do a hash for the URL, we can do it for all the vary headers too, so the cache entry is smaller and the time to serialize or deserialize will be faster.

josacar avatar Aug 30 '19 07:08 josacar

I think the same as we do a hash for the URL, we can do it for all the vary headers too

It's not as simple though, the Vary header is in the response, not the request, so if you do GET /foo the middleware don't know which headers to hash, hence which entry to look for. So you need to either look at all / some entries to consider historical values of the Vary header, or maintain some kind of index.

byroot avatar Aug 30 '19 08:08 byroot

I'd start by the simple solution. If that doesn't work well we can increase the complexity.

rafaelfranca avatar Sep 04 '19 04:09 rafaelfranca

Hi, did you end up solving this? I'm curious if you find out a good solution for it.

georgeguimaraes avatar Apr 06 '20 21:04 georgeguimaraes

We didn't implement a solution but I think Very simple, set a cap on the number of entries per key is a good solution.

rafaelfranca avatar Apr 06 '20 21:04 rafaelfranca

We ran into this same problem and went with limiting the number of entries stored per cache key. Take a look at #121 and let me know what you think!

tjwallace avatar Oct 01 '20 20:10 tjwallace