citus icon indicating copy to clipboard operation
citus copied to clipboard

`alter_table_set_access_method` AccessExclusiveLock

Open jprudent opened this issue 1 year ago • 1 comments

Hello,

When I call alter_table_set_access_method there is some period of time where there is an AccessExclusiveLock set on the parent table.

This lock can be set for a while because I'm working on very large 0.3T bytes partitions.

# SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    l.locktype,
    l.mode,
    l.granted
FROM
    pg_locks l
JOIN
    pg_class c ON l.relation = c.oid
JOIN
    pg_namespace n ON c.relnamespace = n.oid
WHERE
    c.relname = 'calls';
 schema_name | table_name | locktype |        mode         | granted 
-------------+------------+----------+---------------------+---------
 v1          | calls      | relation | AccessShareLock     | t
 v1          | calls      | relation | AccessExclusiveLock | t
 v1          | calls      | relation | AccessShareLock     | f

Even table metadata are not accessible: \dt+ v1.calls will just wait for the lock release.

From PG documentation:

Only an ACCESS EXCLUSIVE lock blocks a SELECT (without FOR UPDATE/SHARE) statement.

This is problematic in my usecase because PG cannot serve data to end user.

I forgot to mention that I use citus on single node, I'm mainly interested on columnar storage.

Here is a few suggestions, not sure what they imply:

  • Add a parameter to choose less drastic lock mechanism that permits read. I may have read inconsistencies for a short period, but I would be fine with it I think.
  • Reduce the scope of the lock (maybe it's mandatory for the partition swapping ?). My application can guarantee it won't write in old partition. If so that's a bug on my side, I can take the blame.

PS: I found this bug, somehow related https://github.com/citusdata/citus/issues/7177 (I think the deadlock comes from the same lock I describe here)

Thanks

jprudent avatar Jun 20 '24 12:06 jprudent

After investigating further ...

It looks like this is not Citus that explicitely set the lock.

This is pseudo SQL of the what does Citus when a table is converted

begin;

-- create the target partition with new storage without any index, constraints, ...
create table new_partition;
-- copy the data
insert into new_partition select * from partition;
-- 🔒 the lock is set here 
alter table parent detach partition
drop table partition;
alter table new_partition rename to partition;
-- ⌚ this can be very long:
alter table partition create index, constraints;
alter table parent attach partition;
-- 🔓 the lock is released here
commit;

The lock on the parent table is granted automatically by PG when we detach the partition (which is ok); and it's released once the tx ends.

If we could exclude the creation of index, constraints, ... from the time the lock is granted, that would greatly reduce the time my application freeze to a few seconds, just to do:

alter table parent detach partition;
drop table partition;
alter table new_partition rename to partition;
alter table parent attach partition;

Is it possible to create the index and constraints just after the copy of the data, (and just before the detach that creates the lock) ? That would considerably decrease the locking time.

jprudent avatar Jun 21 '24 09:06 jprudent

We implemented what I described.

It's quite naive because we create the partition using the LIKE shortcut (which is probably not adapted to more circumvent models). Also, there is an obvious safety trade off.

So for that guy:

-- This function won't lock tables unnecessarily. So it's a bit unsafe to use it:
-- Caller must ensure that partition_table is not changed during the execution of this function.
-- Any alteration to partition_table during the execution of this function will result in data loss.
-- This is a necessary evil to privilege uptime over consistency.
CREATE OR REPLACE FUNCTION common.my_alter_table_set_access(schema text, partition_table text, access_method text) RETURNS void AS
$$
DECLARE
    temp_partition_table    text;
    current_partition_access_method   text;
    fq_temp_partition_table text;
    fq_partition_table      text;
    parent_table            text;
    fq_parent_table         text;
    partition_from_value    text;
    partition_to_value    text;
BEGIN

    EXECUTE format(
            'SELECT pg_am.amname FROM pg_class JOIN pg_am ON pg_class.relam = pg_am.oid WHERE pg_class.relname = %L and pg_class.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = %L)',
            partition_table, schema) INTO current_partition_access_method;
    IF current_partition_access_method = access_method THEN
        RAISE EXCEPTION 'The table % already uses the % access method', partition_table, access_method;
    END IF;

    EXECUTE format(
            'SELECT parent.relname FROM pg_inherits JOIN pg_class parent ON parent.oid = inhparent JOIN pg_class child ON child.oid = inhrelid WHERE child.relname = %L and child.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = %L)',
            partition_table, schema) INTO parent_table;
    IF parent_table IS NULL THEN
        RAISE EXCEPTION 'Parent table of % not found', partition_table;
    END IF;

    temp_partition_table := partition_table || '_copy';
    fq_temp_partition_table := format('%I.%I', schema, temp_partition_table);
    fq_partition_table := format('%I.%I', schema, partition_table);
    fq_parent_table := format('%I.%I', schema, parent_table);

    EXECUTE format('select from_value, to_value from time_partitions where partition = %L::regclass', fq_partition_table) INTO partition_from_value, partition_to_value;

    EXECUTE format('CREATE TABLE %s (LIKE %s INCLUDING ALL) USING %s', fq_temp_partition_table, fq_partition_table, access_method);
    EXECUTE format('INSERT INTO %s SELECT * FROM %s', fq_temp_partition_table, fq_partition_table);

    -- An exclusive lock is required from here
    EXECUTE format('ALTER TABLE %s DETACH PARTITION %s', fq_parent_table, fq_partition_table);
    EXECUTE format('DROP TABLE %s', fq_partition_table);
    EXECUTE format('ALTER TABLE %s RENAME TO %s', fq_temp_partition_table, partition_table);
    -- Exclusive lock is released here
    EXECUTE format('ALTER TABLE %s ATTACH PARTITION %s FOR VALUES FROM (%L) TO (%L)', fq_parent_table, fq_partition_table, partition_from_value, partition_to_value);
END;
$$ LANGUAGE plpgsql;

jprudent avatar Jul 10 '24 09:07 jprudent