hydra Autovacuum takes hours on PostgreSQL

Preflight checklist

[X] I could not find a solution in the existing issues, docs, nor discussions.
[X] I agree to follow this project's Code of Conduct.
[X] I have read and am following this repository's Contribution Guidelines.
[ ] This issue affects my Ory Cloud project.
[ ] I have joined the Ory Community Slack.
[ ] I am signed up to the Ory Security Patch Newsletter.

Describe the bug

We have a refresh token table with millions of tokens, and the bigger it is, the more the autovacuum takes times (many hours).

Recently it gets worse, PostgreSQL executes a dozen autovacuum per day, each taking more than one hours. Thus, the database is slowing down, and writes a lot on disk.

By looking at the pg_stat_all_tables we surprisingly see that we have since the beginning, 20 millions of inserts and 500 millions of updates in this table !

After a long analysis, we found that the following query, updates more row than needed :

UPDATE hydra_oauth2_refresh SET active=false WHERE request_id=?

To detect malicious reuse of refresh_token, Hydra keep previous refresh tokens and set its status to inactive.

However, if an user accumulates for example 20 refresh tokens, RT1 - RT2 - ... - RT20 (only RT20 is active), and then the user refreshes its token to get RT21, the 20 previous tokens will be selected by the update query, even refresh tokens that are already inactives !

With MySQL, only rows that effectively changed, will cause writes to disk...thus the overhead is mainly CPU. But with PostgreSQL, even if a row effectively not changed, a new version of this row will be created, and former versions should be garbaged by the autovacuum process later.

Thus, the number of dead rows at each token refresh by an user is arithmetic suite : 1 + 2 + 3 + 4 + 5 + ... + n

Reproducing the bug

This issue still arises on v2 branch, here some stats each time we refresh a token :

select n_tup_ins, n_tup_upd from pg_stat_all_tables where relname='hydra_oauth2_refresh'

n_tup_ins n_tup_upd 1 0

n_tup_ins n_tup_upd 2 1

n_tup_ins n_tup_upd 3 3

n_tup_ins n_tup_upd 4 6

n_tup_ins n_tup_upd 5 10

n_tup_ins n_tup_upd 6 15

n_tup_ins n_tup_upd 7 21

The number of updates follows the arithmetic suite : 1 + 2 + 3 + 4 + 5 + ... + n

Relevant log output

No response

Relevant configuration

No response

Version

1.10.x / v2.x

On which operating system are you observing this issue?

No response

In which environment are you deploying?

No response

Additional Context

To solve this bug, a more selective query is necessary to only inactivate token that are active :

UPDATE hydra_oauth2_refresh SET active=false WHERE request_id=? and active=true

Jun 03 '22 10:06 arnolf

Great find! That makes a lot of sense - we probably need to add an index for this too. Would you be willing to supply a PR for this? If so, please make it against the v2.x branch as we no longer accept SQL changes to master!

Jun 03 '22 10:06 aeneasr

@aeneasr do you know if this feature has been implemented for v2.x? if not we would like to contribute it =D.

We're beginning to see the same symptoms

Oct 06 '22 08:10 kjuulh

And can we remove inactive rows, to reduce the problem in the meanwhile? (I don't know if there are any cascading issues doing that)

Oct 06 '22 08:10 kjuulh

hydra hydra copied to clipboard

Autovacuum takes hours on PostgreSQL

Preflight checklist

Describe the bug

Reproducing the bug

Relevant log output

Relevant configuration

Version

On which operating system are you observing this issue?

In which environment are you deploying?

Additional Context

hydra
hydra copied to clipboard