activerecord-safer_migrations icon indicating copy to clipboard operation
activerecord-safer_migrations copied to clipboard

Retrying migrations

Open matthieuprat opened this issue 6 years ago • 3 comments

Genuine question: have you considered automatically retrying migrations if they fail because of a lock timeout?

I'm asking this because we (@doctolib) faced a similar issue as the one described in Zero-downtime Postgres migrations - the hard parts. We lowered PG's lock timeout but our migrations sometimes hit this timeout which makes our deployment process fail. And we typically want to make this process as reliable as possible.

Not sure at all this "retry" idea is a good one (I can see several downsides) but at least, it sounds feasible under certain conditions (for instance, the migration would have to be idempotent—which is the case if it's wrapped in a transaction).

matthieuprat avatar Jun 13 '18 09:06 matthieuprat

Hi @matthieuprat, sorry for the mega slow response!

We're currently looking at the data for why schema migrations fail to apply at GoCardless - in particular when they run into the lock or statement timeout. Our initial impression is that adding some sort of retry mechanism would help reduce the number of failures that need manual retry.

Anything we implement around retries will live on a branch or in an alpha release until we're happy that they're valuable.

We'll keep this ticket updated with what we find out.

Sinjo avatar Oct 03 '18 15:10 Sinjo

Thank you for your reply!

We ended up rolling our own solution that we open-sourced last week: safe-pg-migrations. It's not ready for prime time yet but it implements a retry mechanism for statements that needs to acquire an ACCESS EXCLUSIVE lock. It also logs blocking statements if a blocked statement fails with a lock timeout.

From our experience, apart from long transactions, lengthly vacuum freeze on huge tables are causing lock timeouts.

matthieuprat avatar Oct 15 '18 08:10 matthieuprat

+1 for this feature. On really busy OLTP systems it's extremely likely to hit lock_timeout, but just as likely that within 5 minutes of brief retries you will indeed get the lock you need. In our env we have several tools that use this exact pattern for anything requiring ACCESS EXCLUSIVE.

jfinzel avatar Jan 08 '19 17:01 jfinzel