chproxy icon indicating copy to clipboard operation
chproxy copied to clipboard

Redis connection loss resiliency

Open Vince-Chenal opened this issue 2 years ago • 2 comments

Description

Trying to solve this issue: https://github.com/ContentSquare/chproxy/issues/337

Commit by commit:

  1. Stop checking redis connectivity at application startup
  2. Add a background routine that ping redis and maintain its status
  3. Serve requests from cache only when cache is seen as alive
  4. Add a cache_alive metric to enable monitoring cache connectivity
  5. Update rediscache wrong instantiation test according to 1.

Remark: the method serveFromCache already fallback to proxying requests to Clickhouse (after some time...) in case of error so this PR just add a layer in front of this mechanism.

I did not know what tests to add so feel free to tell me

Pull request type

Please check the type of change your PR introduces:

  • [x] Bugfix
  • [ ] Feature
  • [ ] Code style update (formatting, renaming)
  • [ ] Refactoring (no functional changes, no api changes)
  • [ ] Build related changes
  • [ ] Documentation content changes
  • [ ] Other (please describe):

Checklist

  • [x] Linter passes correctly
  • [ ] Add tests which fail without the change (if possible)
  • [x] All tests passing
  • [ ] Extended the README / documentation, if necessary

Does this introduce a breaking change?

  • [ ] Yes
  • [x] No

Further comments

Vince-Chenal avatar Jan 09 '24 15:01 Vince-Chenal

Your Render PR Server URL is https://chproxy-pr-390.onrender.com.

Follow its progress at https://dashboard.render.com/static/srv-cmeme9a1hbls738qmd0g.

render[bot] avatar Jan 09 '24 15:01 render[bot]

So as discussed offline this PR does have an impact on concurrent transactions

As far as I can see the following could happen in the case of a concurrent transaction: TransactionRegistry is used to maintain status of concurrent queries between multiple proxies. Concurrent Transactions are checked in AwaitForConcurrentTransaction, which will return TransactionStatus{State: transactionAbsent} + an error in case of an error on redis.Get.

That means that transaction would fail: as the error would be returned and in the serveFromCache method we would return http.StatusInternalServerError.

However, the PR adds a guard around serveFromCache based on the state of the redis healthcheck, so this should only affect in-flight requests at the time of the healthcheck failure (and probably is acceptable).

It does mean we suddenly lose protection from the Thundering Herd problem (https://www.chproxy.org/configuration/caching/).

As a future improvement we can allow for the cache to fallback to file/in-memory caching + transaction maintenance so we can at least provide a certain level of protection even in case of Redis failure.

We can decide to use a configuration option to allow users to decide between protection from Thundering herd or protection from Redis failure.

One potential area of improvement is too also use the Alive value to short circuit certain operations of the async cache (such as AwaitForConcurrentTransaction, no need to try to reach redis) or even use it as a fallback to a simpler in-memory cache.

Blokje5 avatar Jan 15 '24 14:01 Blokje5