citus icon indicating copy to clipboard operation
citus copied to clipboard

Rebalance bug with local tables on citus 10.2

Open gledis69 opened this issue 8 months ago • 2 comments

I can't seem to reproduce the issue, but these are my observations:

Local table A with a shard A_1234 on the coordinator. Reference table B has a fk to the local table. Distributed table C has a fk to the reference table.

When a new node is added and rebalance triggered, the operation fails with "Table A_1234 not found" the error is received from a connection on the newly added node.

Looking on the logs of the new node these things stand out: LOG: logical replication table synchronization worker for subscription "...", table "A_3456" has started. LOG: logical replication table synchronization worker for subscription "...", table "A_3456" has finished. SELECT worker_apply_inter_shard_ddl_command(7890, 'public', 1234, 'public', 'ALTER TABLE public.B ADD CONSTRAINT B_to_A_fk FOREIGN KEY (A_id) REFERENCES public.A(id) NOT VALID')

The fk between the reference table shard on the new node and the local table shard on the coordinator (A_1234) is being attempted to be created. This naturally fails. Also, a logical replication subscription is created for some A_2345 shard. This shard does not seem to exist anywhere else in the cluster.

The A table and its shard are not involved in the rebalance plan generated by get_rebalance_table_shards_plan, and yet it seems like the rebalance is trying to copy some shard A_3456 (??and succeeds??), then tries to create the constraint between the B_7890 shard and A_1234.

gledis69 avatar Dec 30 '23 17:12 gledis69

The rebalancer got hugely refactored and improved since 10.2, it might be worth trying this out with a newer Citus version.

JelteF avatar Dec 30 '23 18:12 JelteF

@JelteF do you think upgrading to 11 would be sufficient? We want to minimize the amount of change on their workload.

hanefi avatar Jan 11 '24 11:01 hanefi