hydra
hydra copied to clipboard
Autovacuum takes hours on PostgreSQL
Preflight checklist
- [X] I could not find a solution in the existing issues, docs, nor discussions.
- [X] I agree to follow this project's Code of Conduct.
- [X] I have read and am following this repository's Contribution Guidelines.
- [ ] This issue affects my Ory Cloud project.
- [ ] I have joined the Ory Community Slack.
- [ ] I am signed up to the Ory Security Patch Newsletter.
Describe the bug
We have a refresh token table with millions of tokens, and the bigger it is, the more the autovacuum takes times (many hours).
Recently it gets worse, PostgreSQL executes a dozen autovacuum per day, each taking more than one hours. Thus, the database is slowing down, and writes a lot on disk.
By looking at the pg_stat_all_tables we surprisingly see that we have since the beginning, 20 millions of inserts and 500 millions of updates in this table !
After a long analysis, we found that the following query, updates more row than needed :
UPDATE hydra_oauth2_refresh SET active=false WHERE request_id=?
To detect malicious reuse of refresh_token, Hydra keep previous refresh tokens and set its status to inactive.
However, if an user accumulates for example 20 refresh tokens, RT1 - RT2 - ... - RT20 (only RT20 is active), and then the user refreshes its token to get RT21, the 20 previous tokens will be selected by the update query, even refresh tokens that are already inactives !
With MySQL, only rows that effectively changed, will cause writes to disk...thus the overhead is mainly CPU. But with PostgreSQL, even if a row effectively not changed, a new version of this row will be created, and former versions should be garbaged by the autovacuum process later.
Thus, the number of dead rows at each token refresh by an user is arithmetic suite : 1 + 2 + 3 + 4 + 5 + ... + n
Reproducing the bug
This issue still arises on v2 branch, here some stats each time we refresh a token :
select n_tup_ins, n_tup_upd from pg_stat_all_tables where relname='hydra_oauth2_refresh'
n_tup_ins n_tup_upd 1 0
n_tup_ins n_tup_upd 2 1
n_tup_ins n_tup_upd 3 3
n_tup_ins n_tup_upd 4 6
n_tup_ins n_tup_upd 5 10
n_tup_ins n_tup_upd 6 15
n_tup_ins n_tup_upd 7 21
The number of updates follows the arithmetic suite : 1 + 2 + 3 + 4 + 5 + ... + n
Relevant log output
No response
Relevant configuration
No response
Version
1.10.x / v2.x
On which operating system are you observing this issue?
No response
In which environment are you deploying?
No response
Additional Context
To solve this bug, a more selective query is necessary to only inactivate token that are active :
UPDATE hydra_oauth2_refresh SET active=false WHERE request_id=? and active=true
Great find! That makes a lot of sense - we probably need to add an index for this too. Would you be willing to supply a PR for this? If so, please make it against the v2.x
branch as we no longer accept SQL changes to master!
@aeneasr do you know if this feature has been implemented for v2.x? if not we would like to contribute it =D.
We're beginning to see the same symptoms
And can we remove inactive rows, to reduce the problem in the meanwhile? (I don't know if there are any cascading issues doing that)