citus icon indicating copy to clipboard operation
citus copied to clipboard

Citus 12.1 stopping rebalance_table_shards(drain_only := true) causes duplicate shards and needs manual intervention

Open seppelucas opened this issue 2 months ago • 3 comments

Context: removing nodes from cluster. by first draining them. nodes are marked as shouldhaveshards=false, i am logged in on one of the coordinators.

After starting the drain i wanted to update citus.max_background_task_executors_per_node so i canceled the drain. on restart is got the following error:

ERROR:  shard move failed as the orphaned shard public.<table>_120737 leftover from the previous move could not be cleaned up

When checking the shards i see:

      table_name      | shardid |         shard_name          | citus_table_type | colocation_id |   nodename   | nodeport | shard_size 
----------------------+---------+-----------------------------+------------------+---------------+--------------+----------+------------
<table> |  120737 | <table>_120737 | distributed      |             2 | 10.1.1.33 |     5432 |  735838208
 <table> |  120737 | <table>_120737 | distributed      |             2 | 10.1.1.75 |     5432 |  735838208

33 and 75 are primary and replica for a worker node.

citus=# select * from pg_dist_cleanup;
 record_id | operation_id | object_type |              object_name              | node_group_id | policy_type
-----------+--------------+-------------+---------------------------------------+---------------+-------------
    782755 |         8714 |           1 | public.<table>_127393              |             2 |           1
    782756 |         8714 |           1 | public.<table>_127649 |             2 |           1
    782764 |         8714 |           1 | public.<table>_129697 |             2 |           1
(3 rows)

citus=# SELECT * FROM rebalance_table_shards(drain_only := true);
NOTICE:  Moving shard 120737 from 10.1.1.33:5432 to 10.1.1.87:5432 ...
ERROR:  shard move failed as the orphaned shard public.activities_127393 leftover from the previous move could not be cleaned up
CONTEXT:  while executing command on localhost:5432


citus=# select * from pg_dist_cleanup;
 record_id | operation_id | object_type |              object_name              | node_group_id | policy_type
-----------+--------------+-------------+---------------------------------------+---------------+-------------
    782755 |         8714 |           1 | public.<table>_127393              |             2 |           1
    782756 |         8714 |           1 | public.<table>_127649 |             2 |           1
(2 rows)

citus=# delete from pg_dist_cleanup;
DELETE 2

citus=# SELECT * FROM rebalance_table_shards(drain_only := true);
NOTICE:  Moving shard 120737 from 10.1.1.33:5432 to 10.1.1.87:5432 ...

87 is the primary of a not disabled worker node. I restarted the rebalance by cleaning up the pg_dist_cleanup as referenced in issue/#7333


Questions:

Is this expected behavior? Should the cleanup make it possible to restart the rebalance if i gave it more time? or is intervention required.

seppelucas avatar Oct 09 '25 08:10 seppelucas

I don't think this is expected, it seems to be caused by the error you shared. Could you check if there are any more detailed logs around this error that explain why the shard couldn't be cleaned up?

ERROR: shard move failed as the orphaned shard public.<table>_120737 leftover from the previous move could not be cleaned up

onurctirtir avatar Oct 09 '25 11:10 onurctirtir

Hello @onurctirtir, Thanks for your response.

I see the following logs from 10.1.1.33:

2025-10-09 10:47:52 UTC [3981040]: [8-1] 68e7919b.3cbef0 0 citus postgres citus_internal gpid=10001189197 10.x.x.x ERROR:  canceling statement due to user request
2025-10-09 10:47:52 UTC [3981043]: [6-1] 68e7919b.3cbef3 0 citus postgres citus_internal gpid=10001189197 10.x.x.x ERROR:  canceling statement due to user request
2025-10-09 10:47:52 UTC [3981044]: [3-1] 68e7919b.3cbef4 0 citus postgres citus_internal gpid=10001189197 10.x,x,x  ERROR:  canceling statement due to user request
2025-10-09 10:47:52 UTC [3981045]: [3-1] 68e7919b.3cbef5 0 citus postgres citus_internal gpid=10001189197 10.x.x.x ERROR:  canceling statement due to user request
2025-10-09 10:47:52 UTC [3981046]: [3-1] 68e7919b.3cbef6 0 citus postgres citus_internal gpid=10001189197 10.x.x.x ERROR:  canceling statement due to user request
2025-10-09 10:47:53 UTC [3980972]: [7-1] 68e79163.3cbeac 129058287 citus customer citus_internal gpid=10001196595 10.x.x.x ERROR:  duplicate key value violates unique constraint "unique_<>_137377"
2025-10-09 10:47:54 UTC [3981055]: [19-1] 68e7919e.3cbeff 129058295 citus customer citus_internal gpid=10001196997 10.x.x.x ERROR:  duplicate key value violates unique constraint "unique_<>_137377"
2025-10-09 10:47:54 UTC [3980611]: [7-1] 68e79068.3cbd43 129058223 citus customer citus_internal gpid=10001163758 10.x.x.x ERROR:  duplicate key value violates unique constraint "unique_<>_137377"
2025-10-09 10:47:55 UTC [3980906]: [22-1] 68e79134.3cbe6a 129058297 citus customer citus_internal gpid=10001164832 10.x.x.x ERROR:  deadlock detected
2025-10-09 10:47:56 UTC [3979911]: [16-1] 68e78f9a.3cba87 129058278 citus customer citus_internal gpid=10001163811 10.x.x.x ERROR:  canceling statement due to user request
2025-10-09 10:47:57 UTC [3980042]: [38-1] 68e78fb1.3cbb0a 129058170 citus customer citus_internal gpid=10001189605 10.x.x.x ERROR:  duplicate key value violates unique constraint "unique_<>_137377"
2025-10-09 10:47:58 UTC [3978586]: [42-1] 68e78e59.3cb55a 129058171 citus cutomer citus_internal gpid=10001163866 10.x.x.x ERROR:  deadlock detected

the duplicate key ERROR has been around before the rebalance/drainm but could be related.

I am still having issues with draining the nodes. hangs on moving the single 120737 shard. when looking at the pg_stat it is waiting on a client read

SELECT pid,
       state,
       wait_event_type,
       wait_event,
       query_start,
       now() - query_start AS running_for,
       query
FROM pg_stat_activity
WHERE application_name LIKE 'citus%';
---
155448 | active | Client | ClientRead | 2025-10-09 08:32:20.47286+00 | 01:45:42.204921 | SELECT pg_catalog.citus_move_shard_placement(120737,346,3,'auto')

The shard is 700mb and does not seem to progress. this is a production setup with serious hardware and networking. i have bumped citus.max_background_task_executors to 64; and per_node to 8.

I have tried draining with citus_drain_node but it hangs on the same shard move as the rebalance with drain_only. I have also tried block_writes but this causes impact for the customer so currently is not an option.

I am trying to drain the last 3 nodes:


   nodename    | nodeport | shard_count 
---------------+----------+-------------
 10.1.1.86  |     5432 |        4186
 10.1.1.41  |     5432 |        4004
 10.1.1.62  |     5432 |        3822
 10.1.1.239 |     5432 |        3822
 10.1.1.189 |     5432 |        3731
 10.1.1.87  |     5432 |        3458
 10.1.1.33  |     5432 |          91
 10.1.1.68  |     5432 |          91
 10.1.1.38  |     5432 |          91
(9 rows)

seppelucas avatar Oct 09 '25 11:10 seppelucas

The draining of .33 suddenly finished. We traced the pid of the coordinator to just waiting on external connections so let is do its thing. and then suddenly the 91 shards were moved. but no other NOTICE logs in the terminal for any other shards.

seppelucas avatar Oct 09 '25 14:10 seppelucas