citus icon indicating copy to clipboard operation
citus copied to clipboard

An unexpected error when re-adding a node

Open duerwuyi opened this issue 11 months ago • 1 comments

How To Reproduce

Citus version: Citus 12.1.6 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit Need a master database , a worker database and a manager database. Specifically, I use docker-compose to reproduce the error. The error can be reproduced in my computer step by step:

1. create database in Master node

Master node(db to connect: postgres): drop database testdb with (force); Master node(db to connect: postgres): create database testdb; Master node(db to connect: testdb): create extension citus;

2. create database in worker node and add it to Master node

Worker node(db to connect: postgres): drop database testdb with (force); Worker node(db to connect: postgres): create database testdb; Worker node(db to connect: testdb): create extension citus; Master node(db to connect: testdb): select * from citus_add_node('citus-worker-1', 5432); #replaced with your woker_ip and port

3. redo step 2.

Worker node(db to connect: postgres): drop database testdb with (force); Worker node(db to connect: postgres): create database testdb; Worker node(db to connect: testdb): create extension citus; Master node(db to connect: testdb): noticed: SELECT master_get_active_worker_nodes(); will return null this time when I reset a new connection to execute citus_add_node, it will return successfully. select * from citus_add_node('citus-worker-1', 5432); #replaced with your woker_ip and port

4. error

when running SELECT master_get_active_worker_nodes(); on Master node I can correctly get 'citus-worker-1', 5432 then I tried to use psql to load a schema to master node. psql -w -h localhost -U postgres -p 5432 testdb < /tmp/pg_schema_bk.sql 1> /dev/null Although I got

ERROR:  This is an internal Citus function can only be used in a distributed transaction
CONTEXT:  while executing command on citus-worker-1:5432

but I still found that the schema was loaded correctly. but when I tried to distribute a table by excuting SELECT create_distributed_table('t0', 'vkey'); I got the same error:

ERROR:  This is an internal Citus function can only be used in a distributed transaction
CONTEXT:  while executing command on citus-worker-1:5432

I consider it an unexpected error because the "redo" step ends with a successfully executed query adding node to Master. If Citus does not support the series of executions, it should try to throw the error earlier, or the last query to distribute the table should success in theory.

duerwuyi avatar Jan 03 '25 08:01 duerwuyi

Have you tried calling citus_remove_node() before dropping the db from the worker? It seems cluster gets into an inconsistent state with the suggested steps which are not the proper set of updating worker nodes. Would you have more info to share with us.

ihalatci avatar Sep 24 '25 12:09 ihalatci