authentik icon indicating copy to clipboard operation
authentik copied to clipboard

root: Improve Redis config and support

Open PKizzle opened this issue 1 year ago • 68 comments

Details

Resolves #5077, resolves #1285, resolves #3979, resolves #5531

External source code

Go

Python

  • Kombu cluster support adapted from https://github.com/celery/kombu/pull/1021
  • Celery cluster support adapted from https://github.com/hbasria/celery-redis-cluster-backend

Changes

New Features

  • Uses one URL for configuration of Redis (shared between Go and Python)
  • The Python URL parser has been created in a way to resemble the Go version
  • By default uses Redis for broker, channel, backend (current state), but also allows for custom options
  • Moves part of the configuration currently under redis category into new dedicated sections -> split into https://github.com/goauthentik/authentik/pull/7097
  • Adds deprecation check to config parser to allow for easy transition between old and new configuration scheme
  • Old Redis configuration env vars are automatically converted to a Redis config URL if no Redis config URL has been configured otherwise they are ignored
  • The new Redis URL allows environment variables to be included in bash style ${VAR}

Breaking Changes

  • None -> Old configurations are still supported however users are advised to switch to the new URL-based configuration

Additional information

  • Newly supported Redis configurations are only available using the new Redis config URL environment variable
  • The Redis config URL requires special characters in parameters to be URL encoded (e.g. username, password)
  • Has been successfully tested using Redis cluster, sentinel and Unix socket connections

Todo

  • [x] ~~Move from RedisChannelLayer to RedisPubSubChannelLayer~~ -> not possible https://github.com/redis/redis-py/issues/2219
  • [ ] Add documentation for Redis URL parameters
  • [ ] Add unittests for Redis middleware

Sentinel

  • [x] Handle ReadOnly error when master is demoted to slave
  • [x] Fix unittests

Cluster

  • [x] Remove all usage of KEYS, MGET and MSET in Django and Celery
  • [x] Fix unittests
  • [ ] Check all connections are correctly closed e.g. in redis_middleware_channels.py

PKizzle avatar Oct 09 '23 15:10 PKizzle

Deploy Preview for authentik-storybook canceled.

Name Link
Latest commit 2d57d4e70615a3896d38f7fb2dc3cae549d32f97
Latest deploy log https://app.netlify.com/sites/authentik-storybook/deploys/663d6f3edec92d000839eb36

netlify[bot] avatar Oct 09 '23 15:10 netlify[bot]

Deploy Preview for authentik ready!

Name Link
Latest commit dd71ad5625c353dae88ca900ad21821a1e0353a2
Latest deploy log https://app.netlify.com/sites/authentik/deploys/65c6cfbc8399db000860127e
Deploy Preview https://deploy-preview-7118--authentik.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Oct 09 '23 15:10 netlify[bot]

Codecov Report

Attention: Patch coverage is 86.84896% with 202 lines in your changes are missing coverage. Please review.

Project coverage is 92.19%. Comparing base (046b8d5) to head (2d57d4e).

Files Patch % Lines
authentik/root/redis_middleware_celery.py 43.01% 53 Missing :warning:
authentik/root/redis_middleware_kombu.py 79.92% 52 Missing :warning:
authentik/root/redis_middleware_pubsub_channels.py 26.22% 45 Missing :warning:
authentik/root/redis_middleware_channels.py 78.64% 22 Missing :warning:
authentik/root/redis_middleware.py 61.11% 14 Missing :warning:
authentik/root/redis_middleware_django.py 87.65% 10 Missing :warning:
authentik/lib/config.py 96.36% 2 Missing :warning:
authentik/lib/utils/parser.py 99.45% 2 Missing :warning:
authentik/events/context_processors/asn.py 50.00% 1 Missing :warning:
authentik/events/context_processors/geoip.py 50.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7118      +/-   ##
==========================================
- Coverage   92.37%   92.19%   -0.18%     
==========================================
  Files         704      713       +9     
  Lines       34401    35930    +1529     
==========================================
+ Hits        31779    33127    +1348     
- Misses       2622     2803     +181     
Flag Coverage Δ
e2e 48.79% <27.99%> (-0.85%) :arrow_down:
integration 25.45% <27.99%> (+0.08%) :arrow_up:
unit 89.71% <86.13%> (-0.12%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 09 '23 15:10 codecov[bot]

Several parts of this PR have been split up into https://github.com/goauthentik/authentik/pull/7117 and https://github.com/goauthentik/authentik/pull/7097.

PKizzle avatar Oct 09 '23 15:10 PKizzle

Testing image available at https://hub.docker.com/repository/docker/thegrandpkizzle/authentik/tags?page=1&ordering=last_updated Please report any issues especially when using a non-standard Redis configuration (sentinel, cluster).

PKizzle avatar Oct 11 '23 12:10 PKizzle

@PKizzle Looks like its failing on a sql migration... https://gist.github.com/kashalls/b0021ee1e9909d055711710b54d7abee

Just double-confirmed it now. I migrated back to prod authentik and it synced up just fine. Went back to the latest container you published and its stuck on pg deadlock.

kashalls avatar Oct 16 '23 19:10 kashalls

@PKizzle Okay so weird turn of events. I left it alone and went for some lunch. It eventually came backup.

However any attempt at logging in returns an error with the request id: 0e2e114a695c4c3fade06a84fae64872

I've updated this gist with a new log attached.

kashalls avatar Oct 16 '23 20:10 kashalls

The errors in the first log file are related to dce913496e861c26062a469aa63cfea8d4ebee16 that I have merged from main into this branch in 3b68053c2d0807e1c22055f69d3a51e04d2096e6. Maybe @BeryJu wants to have a look at it. The migration seems to deadlock itself. @kashalls could you please open a separate issue to keep track of this?

The second log file seems to hint at a failed migration where ALTER TABLE otp_static_staticdevice RENAME TO authentik_stages_authenticator_static_staticdevice; has not been executed successfully.

PKizzle avatar Oct 16 '23 22:10 PKizzle

That error should be fixed in the latest beta since https://github.com/goauthentik/authentik/pull/7185

BeryJu avatar Oct 16 '23 22:10 BeryJu

I have pushed new images that contain the mentioned fix and removed the tags which resulted in the failed migration errors.

PKizzle avatar Oct 16 '23 23:10 PKizzle

@PKizzle Still seeing this issue https://gist.github.com/kashalls/b0021ee1e9909d055711710b54d7abee#file-authentik-log-2 Server did not complete the migration. Looks like worker did complete the migration though. mqBuPv

It seems to be running migrations every single time the containers run.

Rebooting the container did end up bringing it up, but now I am seeing You can't write against a read only replica. (when getting it the host of the redis master sentinel).

Still trying to get it to fully load on the authentik-server... Will post when or if I get it working

kashalls avatar Oct 17 '23 00:10 kashalls

Alright, so after 16 minutes of it cycling itself I got to this point. https://gist.github.com/kashalls/b0021ee1e9909d055711710b54d7abee#file-authentik-log-3

This may just be a me issue at this point, but I am passing the redis sentinel master host in and its trying to write to the read only node.

Edit: Looks like a upstream issue: https://github.com/redis/redis-py/issues/626

kashalls avatar Oct 17 '23 00:10 kashalls

Let's see whether I can patch redis-py to use to correct command

PKizzle avatar Oct 17 '23 09:10 PKizzle

@kashalls I have updated the images with an updated master discovery implementation for Redis sentinel

PKizzle avatar Oct 17 '23 15:10 PKizzle

I've been trying to test 2023.8.3-abc2229 but I keep getting deadlocks.

Operations to perform:
  Target specific migration: 0009_auto_20190420_0723, from authentik_stages_authenticator_totp
Running migrations:
  Applying authentik_stages_authenticator_totp.0009_auto_20190420_0723... FAKED
2023-10-17 18:58:47 [info     ] Migration finished applying    migration=otp_merge.py
2023-10-17 18:58:47 [info     ] applying django migrations
2023-10-17 18:58:47 [info     ] waiting to acquire database lock
Traceback (most recent call last):
  File "/lifecycle/migrate.py", line 103, in <module>
    wait_for_lock()
  File "/lifecycle/migrate.py", line 57, in wait_for_lock

kashalls avatar Oct 17 '23 19:10 kashalls

I have been testing the images locally using an empty database and have not had any migration issues.

Since it tries to apply the same migration over and over again could you please directly check in your DB whether the operation has been performed successfully? What happens if you run the SQL statement manually?

Edit: In the django_migrations table the authentik_stages_authenticator_totp.0009_auto_20190420_0723 migration might be marked as faked (marked as having happened although it was never applied). You could try to remove the entry and test again. Please note that I am no Django DB migration expert and can not help with any data loss caused by any of the actions above.

PKizzle avatar Oct 17 '23 19:10 PKizzle

So I dropped the database for giggles (I have backups), and I am still seeing that its trying to write to the read-only replica.. https://gist.github.com/kashalls/b0021ee1e9909d055711710b54d7abee#file-2023-8-3-abc2229-log

Checking db3, shows that there are keys for celery tasks so it is getting the master some of the times.

kashalls avatar Oct 17 '23 19:10 kashalls

So you can not find any keys with the "authentik_channels_" prefix?

PKizzle avatar Oct 17 '23 21:10 PKizzle

So you can not find any keys with the "authentik_channels_" prefix?

I missed it when I was scanning the keys, I do see authentik_channels_:group:group_outpost_uuid.

kashalls avatar Oct 17 '23 22:10 kashalls

I try to replicate the problem locally. Your setup is one Redis master + read only replica + sentinel, correct?

PKizzle avatar Oct 17 '23 22:10 PKizzle

I try to replicate the problem locally. Your setup is one Redis master + read only replica + sentinel, correct?

Running redis sentinel, 3 nodes each with 1 redis + 1 sentinel. Master gets elected by sentinel. This is the redis chart I am running: https://github.com/bitnami/charts/tree/main/bitnami/redis

Here is my configuration as code: Authentik

  • externalsecret.yaml holds my env details
  • helmrelease holds the chart and configuration

Redis for redis chart configuration details

kashalls avatar Oct 17 '23 22:10 kashalls

@PKizzle So I did a hard reboot of the entire cluster ( was preforming a os upgrade ). Flushed the cache and everything "seems" to be working fine. Later tonight I will mess with it to see if its stable

kashalls avatar Oct 17 '23 22:10 kashalls

This is the problem I have with the Python Redis implementation: It works under ideal circumstances but as soon as i.e. the cluster is still initializing nothing works as expected. I wonder how other projects can rely on redis-py in its current state.

PKizzle avatar Oct 17 '23 23:10 PKizzle

This is the problem I have with the Python Redis implementation: It works under ideal circumstances but as soon as i.e. the cluster is still initializing nothing works as expected. I wonder how other projects can rely on redis-py in its current state.

When I was rebooting, I noticed that authentik waited a little bit for the redis instances to come up and configure themselves. Question is does it support changing masters?

kashalls avatar Oct 17 '23 23:10 kashalls

In theory it should. However, considering the frankenstein code other projects are using as wrapper around redis-py I can not tell whether it works reliably. Please try and report any issues you encounter.

PKizzle avatar Oct 17 '23 23:10 PKizzle

In theory it should. However, considering the frankenstein code other projects are using as wrapper around redis-py I can not tell whether it works reliably. Please try and report any issues you encounter.

~~I tried really hard to get it to crash on my end. I think because I had my master timeout so low that it failed over quickly and I only had one or two errors about connecting and it reconnected.~~

I take it back. Its throwing the You can't write against a read only replica again after node-2 pulled master from node-0. I am starting to think that the redis maintainers don't know how their projects function considering that I can't use RedisInsight-v2 to visually look at the keys (it fails when connecting to sentinel, but cli works just fine?)

kashalls avatar Oct 18 '23 01:10 kashalls

Could you create another gist with the log file when changing masters?

PKizzle avatar Oct 18 '23 04:10 PKizzle

Could you create another gist with the log file when changing masters?

@PKizzle I can't get you one from the server, struggling to recreate it. Here's what shows up on the worker side: https://gist.github.com/kashalls/0f75b0446e933deed675f8f231be089e#file-authelia-worker-abc2229

kashalls avatar Oct 18 '23 05:10 kashalls

@PKizzle Was there any more progress on this?

kashalls avatar Oct 21 '23 23:10 kashalls

 @kashalls Fixing the ReadOnly error requires manually checking each existing connection for a master change as it is not automatically handled by redis-py. As a quick fix I am now creating new connections for each command send to Redis.

The suggestion in https://github.com/redis/redis-py/issues/688 to handle the ReadOnly error outside of redis-py shows how much basic functionality is still missing in the project. The follow-up issue https://github.com/redis/redis-py/issues/935 was not even answered.

PKizzle avatar Oct 22 '23 16:10 PKizzle