activerecord-safer_migrations
activerecord-safer_migrations copied to clipboard
Retrying migrations
Genuine question: have you considered automatically retrying migrations if they fail because of a lock timeout?
I'm asking this because we (@doctolib) faced a similar issue as the one described in Zero-downtime Postgres migrations - the hard parts. We lowered PG's lock timeout but our migrations sometimes hit this timeout which makes our deployment process fail. And we typically want to make this process as reliable as possible.
Not sure at all this "retry" idea is a good one (I can see several downsides) but at least, it sounds feasible under certain conditions (for instance, the migration would have to be idempotent—which is the case if it's wrapped in a transaction).
Hi @matthieuprat, sorry for the mega slow response!
We're currently looking at the data for why schema migrations fail to apply at GoCardless - in particular when they run into the lock or statement timeout. Our initial impression is that adding some sort of retry mechanism would help reduce the number of failures that need manual retry.
Anything we implement around retries will live on a branch or in an alpha release until we're happy that they're valuable.
We'll keep this ticket updated with what we find out.
Thank you for your reply!
We ended up rolling our own solution that we open-sourced last week: safe-pg-migrations. It's not ready for prime time yet but it implements a retry mechanism for statements that needs to acquire an ACCESS EXCLUSIVE
lock. It also logs blocking statements if a blocked statement fails with a lock timeout.
From our experience, apart from long transactions, lengthly vacuum freeze
on huge tables are causing lock timeouts.
+1 for this feature. On really busy OLTP systems it's extremely likely to hit lock_timeout
, but just as likely that within 5 minutes of brief retries you will indeed get the lock you need. In our env we have several tools that use this exact pattern for anything requiring ACCESS EXCLUSIVE
.