An unexpected error when re-adding a node
How To Reproduce
Citus version: Citus 12.1.6 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
Need a master database , a worker database and a manager database. Specifically, I use docker-compose to reproduce the error.
The error can be reproduced in my computer step by step:
1. create database in Master node
Master node(db to connect: postgres):
drop database testdb with (force);
Master node(db to connect: postgres):
create database testdb;
Master node(db to connect: testdb):
create extension citus;
2. create database in worker node and add it to Master node
Worker node(db to connect: postgres):
drop database testdb with (force);
Worker node(db to connect: postgres):
create database testdb;
Worker node(db to connect: testdb):
create extension citus;
Master node(db to connect: testdb):
select * from citus_add_node('citus-worker-1', 5432); #replaced with your woker_ip and port
3. redo step 2.
Worker node(db to connect: postgres):
drop database testdb with (force);
Worker node(db to connect: postgres):
create database testdb;
Worker node(db to connect: testdb):
create extension citus;
Master node(db to connect: testdb):
noticed: SELECT master_get_active_worker_nodes(); will return null
this time when I reset a new connection to execute citus_add_node, it will return successfully.
select * from citus_add_node('citus-worker-1', 5432); #replaced with your woker_ip and port
4. error
when running
SELECT master_get_active_worker_nodes(); on Master node
I can correctly get 'citus-worker-1', 5432
then I tried to use psql to load a schema to master node.
psql -w -h localhost -U postgres -p 5432 testdb < /tmp/pg_schema_bk.sql 1> /dev/null
Although I got
ERROR: This is an internal Citus function can only be used in a distributed transaction
CONTEXT: while executing command on citus-worker-1:5432
but I still found that the schema was loaded correctly.
but when I tried to distribute a table by excuting SELECT create_distributed_table('t0', 'vkey');
I got the same error:
ERROR: This is an internal Citus function can only be used in a distributed transaction
CONTEXT: while executing command on citus-worker-1:5432
I consider it an unexpected error because the "redo" step ends with a successfully executed query adding node to Master. If Citus does not support the series of executions, it should try to throw the error earlier, or the last query to distribute the table should success in theory.
Have you tried calling citus_remove_node() before dropping the db from the worker? It seems cluster gets into an inconsistent state with the suggested steps which are not the proper set of updating worker nodes. Would you have more info to share with us.