prisma icon indicating copy to clipboard operation
prisma copied to clipboard

Advisory locks not released on failed migrations

Open Runi-c opened this issue 3 years ago • 9 comments

Bug description

I'm developing against an Amazon RDS PostgreSQL instance.

I've noticed that every time I try a migration that fails for some expected reason (e.g. I need to edit the migration to handle a required column being added or something), I'm unable to run subsequent migrations due to advisory lock timeouts. Each attempt opens a new connection, and the connections stay open for an unspecified time period (10-15 minutes?). Development can't continue properly for this time period without taking some other action such as restarting the database instance to resolve the issue.

image

You can see in the above image that each attempt at migrating opens a new connection waiting on the advisory lock. I was unable to complete the migration to resume development during that period. I made several more attempts just to confirm the issue and demonstrate the problem.

How to reproduce

  1. Set up an Amazon RDS PostgreSQL instance (small ones are free), though this problem should be reproducible in other environments with PostgreSQL as well.
  2. Initialize with some basic schema.
  3. Insert some data.
  4. Make a change to the schema that requires manual intervention (e.g. add required column).
  5. Try running prisma migrate dev, have it fail.
  6. Try running prisma migrate dev --create-only as suggested, it should time out waiting on the advisory lock.
  7. All subsequent migrations will fail. The advisory lock is held for an extended period, and each command adds a new open connection waiting on a lock. Development cannot resume until the database server finally drops the connections.

Expected behavior

Prisma migrate should explicitly drop the locks to prevent this interruption to workflow

Prisma information

N/A

Environment & setup

  • OS: Windows 10
  • Database: PostgreSQL
  • Node.js version: v16.14.2

Prisma Version

prisma                  : 3.12.0
@prisma/client          : 3.12.0
Current platform        : windows
Query Engine (Node-API) : libquery-engine 22b822189f46ef0dc5c5b503368d1bee01213980 (at node_modules\@prisma\engines\query_engine-windows.dll.node)
Migration Engine        : migration-engine-cli 22b822189f46ef0dc5c5b503368d1bee01213980 (at node_modules\@prisma\engines\migration-engine-windows.exe)
Introspection Engine    : introspection-core 22b822189f46ef0dc5c5b503368d1bee01213980 (at node_modules\@prisma\engines\introspection-engine-windows.exe)
Format Binary           : prisma-fmt 22b822189f46ef0dc5c5b503368d1bee01213980 (at node_modules\@prisma\engines\prisma-fmt-windows.exe)
Default Engines Hash    : 22b822189f46ef0dc5c5b503368d1bee01213980
Studio                  : 0.459.0

Runi-c avatar Apr 25 '22 22:04 Runi-c

this happened to me when a task running a long migration was killed by ECS for failing health checks -- any new deployments got timed out trying to acquire a postgres advisory lock (SELECT pg_advisory_lock(72707369)). Elapsed: 10000ms. See https://pris.ly/d/migrate-advisory-locking for details.

shawnjones253 avatar Aug 29 '23 04:08 shawnjones253

@shawnjones253 and others:

  • What's your PostgreSQL version here?
  • Are you sure that you are only running 1 instance of migrate on that database at that moment?
  • Is the migration a long-running migration? How much time does this migration take to complete?

Our docs on our Advisory locking mechanism (also applies to migrate dev)

This safeguard ensures that multiple commands cannot run at the same time - for example, if you merge two pull requests in quick succession. Advisory locking has a 10 second timeout (not configurable), and uses the default advisory locking mechanism available in the underlying provider https://www.prisma.io/docs/concepts/components/prisma-migrate/migrate-development-production#advisory-locking

Usually a timeout happens because you already have migrate running on that database, and you started migrate again on the same database. To protect your database from concurrent execution, we use a lock.

Jolg42 avatar Aug 29 '23 08:08 Jolg42

Related work that happened recently about advisory locking that might help but comes with a warning: In https://github.com/prisma/prisma-engines/pull/4171 I added a new environment variable to disable the advisory locking, which was needed for people when using a Percona-XtraDB-Cluster or MariaDB Galera Cluster.

⚠️ In this case here, only use this environment variable if you are sure that you won't start migrate concurrently on your database. You must ensure that only 1 instance of migrate can run at any time when using migrate dev/deploy/resolve.

It can be used like (docs PR):

PRISMA_SCHEMA_DISABLE_ADVISORY_LOCK=1

Note: This is included in 5.3.0-dev.13 if you want to try it out. It will be released in the official 5.3.0 version on Tuesday, September, 12th.

Jolg42 avatar Aug 29 '23 08:08 Jolg42

@Jolg42

  • postgresql 14.2
  • see below
  • yes, it ended up being a much longer migration than expected (hours). For context it was one update statement that was estimated to touch 12M rows with a join.

I believe I understand better what was happening now:

We run migrate when our container starts like: npm migrate && npm serve.

Because the first container to try the migration was killed for taking too long to respond to healthchecks, I assumed the db calls started by the migration was also aborted and that subsequent containers should have been able to attempt to run the migration. Upon further investigation I noticed the statement was still running in postgres by querying pg_stat_activity. So I think the session from the first container was still alive, and that's why the advisory lock hadn't been released? It had been running for about 30m when I killed it and that released the lock and aborted the big update statement.

shawnjones253 avatar Aug 30 '23 17:08 shawnjones253

@Roryl-c is restarting the RDS instance the only way to unlock it again after a failed migration? That'd be quite frustrating, especially on a production CI workflow, if any migration does fail for whatever reason.

Did you resort to disabling advisory locking going forward?

malkhuzayyim avatar Oct 04 '23 19:10 malkhuzayyim

Same problem here what have been your solution?

peterHakio avatar Feb 09 '24 11:02 peterHakio

I am facing a similar issue on a managed instance from Digitalocean. is there any way to fix this without restarting the database since it's a production?

brookmg avatar Jun 07 '24 13:06 brookmg

So if you want to have more data on the lock, you could try this query:

SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa
    ON pl.pid = psa.pid WHERE objid = 72707369;

Example from a local db for me (after manually running SELECT pg_advisory_lock(72707369);)

locktype database relation page tuple virtualxid transactionid classid objid objsubid virtualtransaction pid mode granted fastpath datid datname pid usesysid usename application_name client_addr client_hostname client_port backend_start xact_start query_start state_change wait_event_type wait_event state backend_xid backend_xmin query backend_type
advisory 57371 0 72707369 1 4/111 13716 ExclusiveLock t f 57371 20586 13716 10 prisma TablePlus 192.168.65.1 17685 2024-06-07 14:31:43.58545+00 2024-06-07 14:34:21.740906+00 2024-06-07 14:34:21.740906+00 2024-06-07 14:34:21.74091+00 active 46716 SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa
ON pl.pid = psa.pid WHERE objid = 72707369; client backend

Docs: https://www.postgresql.org/docs/9.4/view-pg-locks.html

You can see in the code that 72707369 is what we use here https://github.com/prisma/prisma-engines/blob/9ee8c029573a3ce47bd5f103da9feaf8b6f62c53/schema-engine/connectors/sql-schema-connector/src/flavour/postgres.rs#L121-L126

-- This would get a lock (interesting to try out locally)
-- SELECT pg_advisory_lock(72707369);

-- Then to release the lock, see example from
-- https://stackoverflow.com/a/55197736/1345244

-- Get the pid
SELECT pid FROM pg_locks WHERE objid = 72707369;

-- Cancel (takes a while)
SELECT pg_cancel_backend(__pid_from_previous_query__);

-- Terminate
SELECT pg_terminate_backend(__pid_from_previous_query__);

Jolg42 avatar Jun 07 '24 14:06 Jolg42

This has been happening with Digital Ocean app platform since February/24. Randomly when new deploys come in, it sometimes locks the application by failing to acquire an advisory lock after 10s.

I suspect it's due to multiple instances of the application trying to acquire a lock concurrently, but I'm not quite sure how to confirm that.

I've been working around it by removing the locks and fixing the migrations, so I don't have to restart the database.

When that happens, what I have to do in order to fix it is this:

First, I check if there are advisory locks stuck:

SELECT 
	*
FROM pg_stat_activity 
where 
	query ilike '%72707369%' 
	and state = 'active'
	and wait_event_type = 'Lock'
	and wait_event = 'advisory';

If there are any, I kill all prisma advisory locks:

SELECT 
	pg_terminate_backend(pid)
FROM pg_stat_activity 
where 
	query ilike '%72707369%' 
	and state = 'active'
	and wait_event_type = 'Lock'
	and wait_event = 'advisory';

Then I look on the _prisma_migrations table to see which migration is marked as failed. In this step, I need to manually check if the migration was executed or not - because I've seen occasions where the migration was successful but marked as failed in the database. For instance, imagine this is the migration: 202408270000_add_column_y and the content is

ALTER TABLE table_x ADD COLUMN     "column_y" VARCHAR;

At that point, I'll try to run select column_y from table_x to confirm if the column exists or not. If the column doesn't exist, I'll delete the row for the migration 202408270000_add_column_y from _prisma_migrations.

The last step is to make sure that the migration code is not stuck running. In this case, I'll see if there are any queries running that are doing an Alter table, because the migration was doing that.

SELECT 
	*
FROM pg_stat_activity 
where 
	state = 'active'
	and wait_event_type = 'Lock'
        and query like '%alter%';

If I find the migration code stuck there, I will kill that query.

SELECT 
	pg_terminate_backend(pid)
FROM pg_stat_activity 
where 
	state = 'active'
	and wait_event_type = 'Lock'
        and query like '%alter%';

After that, you should be able to try to run the migrations again.

I think that if you remove concurrency from the equation it should work fine, so maybe try to move that to your CI/CD pipeline like they recommend at the docs

Lipemenezes avatar Aug 27 '24 19:08 Lipemenezes

Same problem here, also running into this when multiple services concurrently try to get the lock and there is a failure.

rgmvisser avatar Jan 30 '25 14:01 rgmvisser