mastodon icon indicating copy to clipboard operation
mastodon copied to clipboard

Race condition during OAuth flow with read-only PostgreSQL replication

Open jorijn opened this issue 1 year ago • 0 comments

Steps to reproduce the problem

I've set up a read-only DB replica to distribute the load between two DB instances. The replication lag between master and replica is about 50-500 milliseconds.

When authenticating a mobile app, or any OAuth-enabled app for that matter, the calls between authorizing and getting the token to follow up too quickly for the read-only replica to catch up, resulting in the replica being unaware of the newly issued token, thus breaking the OAuth flow.

Users were reporting issues with logging in on their mobile phones, and when I dug deeper using a web debugging proxy I noticed that the subsequential request for fetching the issued OAuth token failed. When I manually replayed the request for fetching the token, it succeeded because replication at that point was completed.

Expected behaviour

POST /oauth/token would retrieve the token instantaneously

Actual behaviour

POST /oauth/token appears to query through the configured replica connection, where the token isn't replicated yet

Detailed description

My database.yml:

default: &default
  adapter: postgresql
  pool: <%= ENV["DB_POOL"] || ENV['MAX_THREADS'] || 5 %>
  timeout: 5000
  encoding: unicode
  sslmode: <%= ENV['DB_SSLMODE'] || "prefer" %>

development:
  <<: *default
  database: <%= ENV['DB_NAME'] || 'mastodon_development' %>
  username: <%= ENV['DB_USER'] %>
  password: <%= (ENV['DB_PASS'] || '').to_json %>
  host: <%= ENV['DB_HOST'] %>
  port: <%= ENV['DB_PORT'] %>

# Warning: The database defined as "test" will be erased and
# re-generated from your development database when you run "rake".
# Do not set this db to the same as development or production.
test:
  <<: *default
  database: <%= ENV['DB_NAME'] || 'mastodon' %>_test<%= ENV['TEST_ENV_NUMBER'] %>
  username: <%= ENV['DB_USER'] %>
  password: <%= (ENV['DB_PASS'] || '').to_json %>
  host: <%= ENV['DB_HOST'] %>
  port: <%= ENV['DB_PORT'] %>

production:
  <<: *default
  adapter: postgresql_makara
  prepared_statements: false
  makara:
    id: postgres
    sticky: true
    connections:
      - role: master
        blacklist_duration: 0
        url: postgresql://<%= ENV['DB_USER'] || 'mastodon' %>:<%= (ENV['DB_PASS'] || '') %>@<%= ENV['DB_HOST'] || 'localhost' %>:<%= ENV['DB_PORT'] || 5432 %>/<%= ENV['DB_NAME'] || 'mastodon_production' %>?sslmode=require
      - role: slave
        url: postgresql://<%= ENV['DB_USER'] || 'mastodon' %>:<%= (ENV['DB_PASS'] || '') %>@<%= ENV['DB_HOST_RO'] || 'localhost' %>:<%= ENV['DB_PORT'] || 5432 %>/<%= ENV['DB_NAME'] || 'mastodon_production' %>?sslmode=require

I've worked around the issue by redirecting all calls to /oauth/* and /api/v1/accounts/verify_credentials to a separate web container that isn't configured with a slave connection.

But, if the user is quick on tapping buttons there's a chance the timeline can't be fetched as well since the token isn't valid yet.

Specifications

Image:         tootsuite/mastodon:v3.5.3
Image ID:      docker.io/tootsuite/mastodon@sha256:bd1d81cd1948cb29e9fd640c902d377fc84fe2567589aca0c290b760d57b6a2b

How would I go about solving this? Would this need fixing in the application, like steering certain time-sensitive calls to the primary DB? A 50-500ms replication lag appears healthy, but it does cause some problems and I don't think I can reduce the lag.

jorijn avatar Nov 12 '22 07:11 jorijn