citus rebalance never starts

With citus 11.3 I've added a node and triggered a rebalance. The rebalance has been scheduled correctly but never starts running despite having 1 runnable task (and 10 blocked ones).

I'm using the docker image citusdata/citus:11.3 in all nodes. The connection between the nodes works (primary is at 10.132.0.2):

SELECT * FROM citus_get_active_worker_nodes();
 node_name  | node_port
------------+-----------
 10.132.0.4 |      5432
 10.132.0.5 |      5432
(2 rows)

Command history:

staging=# SELECT * from citus_add_node('10.132.0.5', 5432);
 citus_add_node
----------------
             10
(1 row)

Time: 623.522 ms
staging=# SELECT citus_rebalance_start();
NOTICE:  Scheduled 10 moves as job 1
DETAIL:  Rebalance scheduled as background job
HINT:  To monitor progress, run: SELECT * FROM citus_rebalance_status();
 citus_rebalance_start
-----------------------
                     1
(1 row)

Time: 26.101 ms
staging=# SELECT * FROM citus_rebalance_status();
 job_id |   state   | job_type  |           description           | started_at | finished_at |                              details
--------+-----------+-----------+---------------------------------+------------+-------------+--------------------------------------------------------------------
      1 | scheduled | rebalance | Rebalance all colocation groups |            |             | {"tasks": [], "task_state_counts": {"blocked": 10, "runnable": 1}}
(1 row)

Time: 3.200 ms
staging=# SELECT pg_terminate_backend(pg_stat_activity.pid)
FROM pg_stat_activity
WHERE pg_stat_activity.datname = 'staging'
  AND pid <> pg_backend_pid();
 pg_terminate_backend
----------------------
 t
 t
 t
 t
 t
 t
 t
 t
(8 rows)
staging=# SELECT get_rebalance_table_shards_plan();
               get_rebalance_table_shards_plan
-------------------------------------------------------------
 (sensor_datapoint,102183,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102182,0,10.132.0.2,5432,10.132.0.5,5432)
 (sensor_datapoint,102185,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102184,0,10.132.0.2,5432,10.132.0.5,5432)
 (sensor_datapoint,102187,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102186,0,10.132.0.2,5432,10.132.0.5,5432)
 (sensor_datapoint,102189,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102188,0,10.132.0.2,5432,10.132.0.5,5432)
 (sensor_datapoint,102191,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102190,0,10.132.0.2,5432,10.132.0.5,5432)
(10 rows)

Time: 4.475 ms
staging=# SELECT * from pg_dist_node;
 nodeid | groupid |  nodename  | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster | metadatasynced | shouldhaveshards
--------+---------+------------+----------+----------+-------------+----------+----------+-------------+----------------+------------------
      1 |       0 | 10.132.0.2 |     5432 | default  | t           | t        | primary  | default     | t              | t
      6 |       5 | 10.132.0.4 |     5432 | default  | t           | t        | primary  | default     | t              | t
     10 |       9 | 10.132.0.5 |     5432 | default  | t           | t        | primary  | default     | t              | t
staging=# ALTER SYSTEM SET citus.max_background_task_executors_per_node = 2;
ALTER SYSTEM
Time: 9.613 ms
staging=# SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

Time: 1.585 ms
staging=# SELECT * FROM citus_rebalance_status() \gx
-[ RECORD 1 ]-------------------------------------------------------------------
job_id      | 1
state       | scheduled
job_type    | rebalance
description | Rebalance all colocation groups
started_at  |
finished_at |
details     | {"tasks": [], "task_state_counts": {"blocked": 10, "runnable": 1}}

Time: 3.033 ms

I've been waiting for a long time and nothing changes.

Aug 03 '23 14:08 senscarlos

Did you check the progress reporting? https://www.citusdata.com/updates/v11-2/#rebalance_lsn

Aug 03 '23 17:08 onderkalaci

Did you check the progress reporting? https://www.citusdata.com/updates/v11-2/#rebalance_lsn

Nothing there:

staging=# SELECT * FROM get_rebalance_progress();
 sessionid | table_name | shardid | shard_size | sourcename | sourceport | targetname | targetport | progress | source_shard_size | target_shard_size | operation_type | source_lsn | target_lsn | status
-----------+------------+---------+------------+------------+------------+------------+------------+----------+-------------------+-------------------+----------------+------------+------------+--------
(0 rows)

Time: 0.507 ms

Aug 04 '23 05:08 senscarlos

staging=# SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = 'staging' AND pid <> pg_backend_pid(); pg_terminate_backend

btw, terminating the backends also terminates the background rebalance tasks/jobs as they are also running in the same database.

Normally, if a rebalance task fails, Citus retries. But maybe for terminate, Citus does not restart?

Also, maybe check the logs to see if something is preventing the rebalancer to start/progress? There might be ERRORs

Aug 04 '23 05:08 onderkalaci

staging=# SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = 'staging' AND pid <> pg_backend_pid(); pg_terminate_backend

btw, terminating the backends also terminates the background rebalance tasks/jobs as they are also running in the same database.

Normally, if a rebalance task fails, Citus retries. But maybe for terminate, Citus does not restart?

Also, maybe check the logs to see if something is preventing the rebalancer to start/progress? There might be ERRORs

First of all, thanks for your help. Here are some suspicious logs:

2023-08-04 05:52:07.924 UTC [315203] WARNING:  unable to start background worker for background task execution
2023-08-04 05:52:07.924 UTC [315203] DETAIL:  Current number of task executors: 0/4
2023-08-04 05:52:07.924 UTC [315203] CONTEXT:  Citus Background Task Queue Monitor: staging

Aug 04 '23 05:08 senscarlos

unable to start background worker for background task execution

Found the error message defined here: https://webcache.googleusercontent.com/search?q=cache:qHGF3I05MvAJ:https://github.com/citusdata/citus/blob/master/src/backend/distributed/utils/maintenanced.c&cd=9&hl=es&ct=clnk&gl=es

Aug 04 '23 05:08 senscarlos

Can you try increasing max_worker_processes in your postgresql.conf and restart postgres? The error suggests that it cannot start any. If that indeed helps, we should add a hint to that error to suggest that approach.

Aug 04 '23 06:08 JelteF

Can you try increasing max_worker_processes in your postgresql.conf and restart postgres? The error suggests that it cannot start any. If that indeed helps, we should add a hint to that error to suggest that approach.

That worked! Now there's an error in the rebalance status output:

ERROR: ERROR:  logical decoding
requires wal_level >= logical\nCONTEXT: while executing command on 10.132.0.2:5432\nwhile executing command on localhost:5432\nCitus Background Task Queue Executor: staging/postgres for (1/1)\n

Fixed by adding the following line to the posgresql.conf of all nodes:

# include enough information in WAL for logical decoding

wal_level = logical

# we need at least one replication slot, but leave headroom

max_replication_slots = 10

# enable Citus' logical decoding improvements for dist tables

citus.enable_change_data_capture = 'on'

As explained in https://docs.citusdata.com/en/stable/develop/integrations.html#logical-replication-of-distributed-tables-to-postgresql-tables

I would have not expected to have to do this configuration changes manually, given that I'm using the oficial Docker image for citus.

Thanks a lot for your help and the amazing project!

Aug 04 '23 07:08 senscarlos

Okay. Seems like two fixes are in order then on our side:

Add HINT to error to increase max_worker_processes
Configure settings to make the rebalancer (and probably CDC) work by default to our docker image.

Aug 04 '23 07:08 JelteF

ccnsapp=# SELECT * FROM get_rebalance_progress(); 

 sessionid |             table_name             | shardid | shard_size | sourcename | sourceport | targetname  | targetport | progress | source_shard_size | target_shard_size | operation_type |  source_lsn  | target_lsn |   status    

-----------+------------------------------------+---------+------------+------------+------------+-------------+------------+----------+-------------------+-------------------+----------------+--------------+------------+------------ 

    129334 | companies                          |  102026 |     212992 | 10.104.0.5 |       5432 | 10.104.0.10 |       5432 |        1 |            212992 |              8192 | move           | 8FE/60D49CD0 |            | Setting Up 

    129334 | agents                             |  102058 |   36380672 | 10.104.0.5 |       5432 | 10.104.0.10 |       5432 |        1 |          36380672 |              8192 | move           | 8FE/60D49CD0 |            | Setting Up 

    129334 | assets                             |  102090 |  196091904 | 10.104.0.5 |       5432 | 10.104.0.10 |       5432 |        1 |         196091904 |              8192 | move           | 8FE/60D49CD0 |            | Setting Up 

    129334 | deprecation_status                 |  102122 |      32768 | 10.104.0.5 |       5432 | 10.104.0.10 |       5432 |        1 |             32768 |                 0 | move           | 8FE/60D49CD0 |            | Setting Up 

    129334 | windows_specifics                  |  102154 |      32768 | 10.104.0.5 |       5432 | 10.104.0.10 |       5432 |        1 |             32768 |                 0 | move           | 8FE/60D49CD0 |            | Setting Up 

    129334 | mac_specifics                      |  102186 |      40960 | 10.104.0.5 |       5432 | 10.104.0.10 |       5432 |        1 |             40960 |              8192 | move           | 8FE/60D49CD0 |            | Setting Up

its says setting up... when it will start while its active and progrees.. why it takes more time?

Jan 25 '24 09:01 shankarmn94

@shankarmn94 did you take a look at the postgres logs for some hints?

Also did you try increasing max_worker_processes, because that helped for @senscarlos

Jan 25 '24 12:01 JelteF

@shankarmn94 did you take a look at the postgres logs for some hints?

Also did you try increasing max_worker_processes, because that helped for @senscarlos

@JelteF @onderkalaci

yes i have incresed to 4 and inititated still same... that was a problem with the node and where node was failing to transfer, network issue.. now the select queries works for existing..

we forcefull removed those shardid in cluster and removed the worker node which was giving issue..

now. while we tried to created distirbuted tables..

ccnsapp=# SELECT create_reference_table('service_ports'); ERROR: table "service_ports" is already distributed ccnsapp=# ccnsapp=# SELECT create_distributed_table('asset_stats', 'tenantid', colocate_with => 'companies'); ERROR: reference table "service_ports" does not have a shard

how can we resolve this..

Jan 29 '24 08:01 shankarmn94

we forcefull removed those shardid in cluster

That's really not recommended. manually removing shards is bound to cause all sorts of issues, including data loss.

If there is no important data in the service_ports table, then I think the easiest way "solve" your current problem is by dropping the service_ports table.

Jan 30 '24 15:01 JelteF