Cluster block query when one node down.

Open iamruhua opened this issue 3 years ago • 1 comments

I follow the instruction to create a citus cluster with 5 worker and set the replication factor to 3. (With docker-compose -p citus up and scaled the worker to 5) Then created a distributed test_table as described in the official online document. Inserted 100k data to the database. I verifyed the shards are distributed by using SELECT * FROM pg_dist_shard_placement, and successfully ran the select * from master server.

Then I was trying to test the high availability, so I shutdown one worker1 and tried the same query select * from test_table.

The expected result will be 100000 returned immediately, since there's a high availability natively built in. The actual result is, the query will be blocked all the way to the expiration. Unless I restart the "failed" node during the blocking time.

I was expecting the cluster to automatically redirect the query to other available worker nodes with the same shards stored in worker 1.Is there anything else I have to do beyond using the "official docker images and instruction"

Aug 16 '22 03:08 iamruhua

In the document, there's a section mentioning "Coordinator Node Failures" and "Worker Node Failures". It seems I need to create standby nodes for worker and coordinator.

Is the coordinator node equal to the docker master node?
From my point of view, in documents Citus 11 has sharding+replication+"Query from any node" functions, am I wrong?
Why we still need the hot standby besides the resource reservation?

Aug 16 '22 03:08 iamruhua