citus icon indicating copy to clipboard operation
citus copied to clipboard

Intermittent failures of unstable citus regression tests

Open saygoodbyye opened this issue 1 month ago • 1 comments

Hello! On our test system, Citus tests are run daily on a large number of operating systems. Virtual machines are created on a hypervisor with 4GB of RAM and 2 cores. This means that the machines may slow down under heavy concurrency loads. This allowed us to identify several unstable tests that crash frequently under these conditions, but less frequently on faster machines.

The following failures can be reproduced by simply running the test in a loop.

The experiments were conducted on the REL_17_STABLE + release-13.0 branch.

CFLAGS='-Og' ./configure --enable-tap-tests --enable-debug --with-openssl --with-icu --with-lz4 --with-libxml --quiet --prefix=$PGPREFIX
  1. multi_extension
for i in `seq 100000`; do echo "ITERATION $i"; make check-multi-1 || break; done
-->> not ok 1     - multi_extension                          3908 ms
-->> # The differences that caused some tests to fail can be viewed in the file "/home/test/pg-tests/tmp/citus/check-multi-1/regression.diffs".
diff -dU10 -w /opt/pgpro/ent-17/src/citus-test/regress/expected/multi_extension.out /home/test/pg-tests/tmp/citus/check-multi-1/results/multi_extension.out
--- /home/test/pg-tests/tmp/citus/check-multi-1/expected/multi_extension.out.modified	2025-08-06 12:47:03.012315772 +0300
+++ /home/test/pg-tests/tmp/citus/check-multi-1/results/multi_extension.out.modified	2025-08-06 12:47:03.042316252 +0300
@@ -1618,21 +1618,21 @@
 -- check that maintenance daemon gets (re-)started for the right user
 DROP EXTENSION citus;
 CREATE USER testuser SUPERUSER;
 SET ROLE testuser;
 CREATE EXTENSION citus;
 SELECT datname, current_database(),
     usename, (SELECT extowner::regrole::text FROM pg_extension WHERE extname = 'citus')
 FROM test.maintenance_worker();
   datname   | current_database |  usename  | extowner 
 ------------+------------------+-----------+----------
- regression | regression       | testuser | testuser
+ regression | regression       | testuser2 | testuser
 (1 row)
 
 -- and recreate as the right owner
 RESET ROLE;
 DROP EXTENSION citus;
 CREATE EXTENSION citus;
 -- Check that maintenance daemon can also be started in another database
 CREATE DATABASE another;
 NOTICE:  Citus partially supports CREATE DATABASE for distributed databases
 DETAIL:  Citus does not propagate CREATE DATABASE command to workers
  1. single_node_enterprise
for i in `seq 100000`; do echo "ITERATION $i"; make check-enterprise || break; done

FAILED iterations 389, 689. Sometimes way more faster because of the concurrent test jobs.

To reproduce faster we can multiply test: single_node_enterprise in enterprise_schedule like below:

test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
diff -dU10 -w /home/test/citus/src/test/regress/expected/single_node_enterprise.out /home/test/citus/src/test/regress/results/single_node_enterprise.out
--- /home/test/citus/src/test/regress/expected/single_node_enterprise.out.modified      2025-11-18 20:24:59.934533920 +0300
+++ /home/test/citus/src/test/regress/results/single_node_enterprise.out.modified       2025-11-18 20:24:59.942534037 +0300
@@ -465,28 +465,30 @@
 NOTICE:  issuing /*{"cId":22,"tId":"101"}*/INSERT INTO single_node_ent.test_90730501 (x, y) VALUES (101, 100)
        INSERT INTO test(x,y) VALUES (102,100);
 NOTICE:  issuing /*{"cId":22,"tId":"102"}*/INSERT INTO single_node_ent.test_90730502 (x, y) VALUES (102, 100)
        -- followed by a multi-shard command
        SELECT count(*) FROM test;
 NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90730501 test WHERE true
 NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90730502 test WHERE true
 NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90730503 test WHERE true
 NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90731504 test WHERE true
 NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90731505 test WHERE true
+NOTICE:  issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 511, '2025-11-18 09:24:59.815208-08');
 NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90731506 test WHERE true
  count
 -------
     53
 (1 row)

 ROLLBACK;
 NOTICE:  issuing ROLLBACK
+NOTICE:  issuing ROLLBACK
 -- should fail as only read access is allowed
 SET ROLE read_access_single_node;
 INSERT INTO test VALUES (1, 1, (95, 'citus9.5')::new_type);
 ERROR:  permission denied for table test
 SET ROLE postgres;
 \c
 SET search_path TO single_node_ent;
 -- Cleanup
 RESET citus.log_remote_commands;
 SET client_min_messages TO WARNING;
  1. multi_explain
for i in `seq 100000`; do echo "ITERATION $i"; make check-multi || break; done
diff -dU10 -w /opt/pgpro/ent-17/src/citus-test/regress/expected/multi_explain.out /home/test/pg-tests/tmp/citus/check-multi/results/multi_explain.out
--- /home/test/pg-tests/tmp/citus/check-multi/expected/multi_explain.out.modified	2025-08-04 21:57:59.663048132 +0300
+++ /home/test/pg-tests/tmp/citus/check-multi/results/multi_explain.out.modified	2025-08-04 21:57:59.701049106 +0300
@@ -1106,40 +1106,40 @@
 	SET l_suppkey = 12
 	FROM orders_hash_part
 	WHERE orders_hash_part.o_orderkey = lineitem_hash_part.l_orderkey;
 Custom Scan (Citus Adaptive)
   Task Count: 4
   Tasks Shown: One of 4
   ->  Task
         Node: host=localhost port=57637 dbname=regression
         ->  Update on lineitem_hash_part_360041 lineitem_hash_part
               ->  Hash Join
-                    Hash Cond: (lineitem_hash_part.l_orderkey = orders_hash_part.o_orderkey)
-                    ->  Seq Scan on lineitem_hash_part_360041 lineitem_hash_part
-                    ->  Hash
+                    Hash Cond: (orders_hash_part.o_orderkey = lineitem_hash_part.l_orderkey)
                     ->  Seq Scan on orders_hash_part_360045 orders_hash_part
+                    ->  Hash
+                          ->  Seq Scan on lineitem_hash_part_360041 lineitem_hash_part
 -- Test delete with subquery
 EXPLAIN (COSTS FALSE)
 	DELETE FROM lineitem_hash_part
 	USING orders_hash_part
 	WHERE orders_hash_part.o_orderkey = lineitem_hash_part.l_orderkey;
 Custom Scan (Citus Adaptive)
   Task Count: 4
   Tasks Shown: One of 4
   ->  Task
         Node: host=localhost port=57637 dbname=regression
         ->  Delete on lineitem_hash_part_360041 lineitem_hash_part
               ->  Hash Join
-                    Hash Cond: (lineitem_hash_part.l_orderkey = orders_hash_part.o_orderkey)
-                    ->  Seq Scan on lineitem_hash_part_360041 lineitem_hash_part
-                    ->  Hash
+                    Hash Cond: (orders_hash_part.o_orderkey = lineitem_hash_part.l_orderkey)
                     ->  Seq Scan on orders_hash_part_360045 orders_hash_part
+                    ->  Hash
+                          ->  Seq Scan on lineitem_hash_part_360041 lineitem_hash_part
 -- Test track tracker
 EXPLAIN (COSTS FALSE)
 	SELECT avg(l_linenumber) FROM lineitem WHERE l_orderkey > 9030;
 Aggregate
   ->  Custom Scan (Citus Adaptive)
         Task Count: 2
         Tasks Shown: One of 2
         ->  Task
               Node: host=localhost port=57637 dbname=regression
               ->  Aggregate
  1. multi_cluster_management
for i in `seq 100000`; do echo "ITERATION $i"; make check-multi || break; done

Fails very often. When I've been testing this fails, it failed on 1'st iteration.

diff -dU10 -w /home/test/citus/src/test/regress/expected/multi_cluster_management.out /home/test/citus/src/test/regress/results/multi_cluster_management.out
--- /home/test/citus/src/test/regress/expected/multi_cluster_management.out.modified    2025-11-19 07:37:47.889673630 +0300
+++ /home/test/citus/src/test/regress/results/multi_cluster_management.out.modified     2025-11-19 07:37:47.905673868 +0300
@@ -84,20 +84,21 @@
         1
 (1 row)

 SELECT citus_disable_node('localhost', :worker_2_port);
  citus_disable_node
 --------------------

 (1 row)

 SELECT public.wait_until_metadata_sync(60000);
+WARNING:  waiting for metadata sync timed out
  wait_until_metadata_sync
 --------------------------

 (1 row)

 SELECT master_get_active_worker_nodes();
  master_get_active_worker_nodes
 --------------------------------
  (localhost,57637)
 (1 row)
  1. prepared_statements_1 and ch_benchmarks_3
for i in `seq 100000`; do echo "ITERATION $i"; make check-arbitrary-configs || break; done
diff -dU10 -w /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/expected/prepared_statements_1.out /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/results/prepared_statements_1.out
--- /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/expected/prepared_statements_1.out.modified	2025-08-08 10:59:21.098350547 +0300
+++ /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/results/prepared_statements_1.out.modified	2025-08-08 10:59:21.103350557 +0300
@@ -2,25 +2,22 @@
 PREPARE repartition_prepared(int) AS
 	SELECT
 		count(*)
 	FROM
 		repartition_prepared_test t1
 			JOIN
 		repartition_prepared_test t2
 			USING (b)
 		WHERE t1.a = $1;
 EXECUTE repartition_prepared (1);
- count
----------------------------------------------------------------------
-   100
-(1 row)
-
+ERROR:  cannot connect to localhost:10246 to fetch intermediate results
+CONTEXT:  while executing command on localhost:10245
 BEGIN;
 	-- CREATE TABLE ... AS EXECUTE prepared_statement tests
 	CREATE TEMP TABLE repartition_prepared_tmp AS EXECUTE repartition_prepared(1);
 	SELECT count(*) from repartition_prepared_tmp;
  count 
 -------
      1
 (1 row)
 
 ROLLBACK;
diff -dU10 -w /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/expected/ch_benchmarks_3.out /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/results/ch_benchmarks_3.out
--- /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/expected/ch_benchmarks_3.out.modified	2025-08-08 10:59:30.743369547 +0300
+++ /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/results/ch_benchmarks_3.out.modified	2025-08-08 10:59:30.748369557 +0300
@@ -1,26 +1,21 @@
 SET search_path to "ch benchmarks";
 -- Subquery + repartition is supported when it is an IN query where the subquery
 -- returns unique results (because it's converted to an INNER JOIN)
 select  s_i_id
     from stock, order_line
     where
         s_i_id in (select i_id from item)
         AND s_i_id = ol_i_id
     order by s_i_id;
- s_i_id
----------------------------------------------------------------------
-      1
-      2
-      3
-(3 rows)
-
+ERROR:  cannot connect to localhost:10247 to fetch intermediate results
+CONTEXT:  while executing command on localhost:10247
 select   su_name, su_address
 from     supplier, nation
 where    su_suppkey in
         (select  mod(s_i_id * s_w_id, 10000)
         from     stock, order_line
         where    s_i_id in
                 (select i_id
                  from item
                  where i_data like 'ab%')
              and ol_i_id=s_i_id
  1. multi_mx_node_metadata
for i in `seq 100000`; do echo "ITERATION $i"; make check-multi-mx || break; done
diff -dU10 -w /opt/pgpro/ent-17/src/citus-test/regress/expected/multi_mx_node_metadata.out /home/test/pg-tests/tmp/citus/check-multi-mx/results/multi_mx_node_metadata.out
--- /home/test/pg-tests/tmp/citus/check-multi-mx/expected/multi_mx_node_metadata.out.modified	2025-08-05 09:02:39.131772001 +0300
+++ /home/test/pg-tests/tmp/citus/check-multi-mx/results/multi_mx_node_metadata.out.modified	2025-08-05 09:02:39.146772062 +0300
@@ -284,26 +284,23 @@
  
 (1 row)
 
 select wait_until_metadata_sync(30000);
  wait_until_metadata_sync 
 --------------------------
  
 (1 row)
 
 select pid as pid_after_error from pg_stat_activity where application_name like 'Citus Met%' \gset
+no rows returned for \gset
 select :pid_before_error != :pid_after_error AS metadata_sync_restarted;
- metadata_sync_restarted
----------------------------------------------------------------------
- t
-(1 row)
-
+ERROR:  syntax error at or near ":" at character 17
 SELECT trigger_metadata_sync();
  trigger_metadata_sync 
 -----------------------
  
 (1 row)
 
 SELECT wait_until_metadata_sync(30000);
  wait_until_metadata_sync 
 --------------------------
  1. multi_metadata_sync
for i in `seq 100000`; do echo "ITERATION $i"; make check-full || break; done

See attached regression.diffs

  1. This test fails because an OOM killer arrives and kills the process during the test. This occurs on a virtual machine with 4GB of RAM.
python3.10 -m pipenv run make check-arbitrary-configs CONFIGS=CitusThreeWorkersManyShardsClusterConfig
CitusThreeWorkersManyShardsClusterConfig: FAIL: see ./tmp_citus_test/dataCitusThreeWorkersManyShardsClusterConfig/run.out runtime: 288.97670102119446 seconds

test@ubuntu-2204:~/work/citus/src/test/regress/tmp_citus_test/dataCitusThreeWorkersManyShardsClusterConfig$ cat regression.out 
# using postmaster on Unix socket, port 10215
# parallel group (3 tests):  prepared_statements_3 prepared_statements_2 prepared_statements_1
ok 1         + prepared_statements_1                   27788 ms
ok 2         + prepared_statements_2                    7391 ms
ok 3         + prepared_statements_3                     551 ms
ok 4         - prepared_statements_4                   35234 ms
# parallel group (3 tests):  ch_benchmarks_1 ch_benchmarks_3 ch_benchmarks_2
ok 5         + ch_benchmarks_1                         32910 ms
ok 6         + ch_benchmarks_2                         52325 ms
ok 7         + ch_benchmarks_3                         51747 ms
# parallel group (3 tests):  ch_benchmarks_5 ch_benchmarks_4 ch_benchmarks_6
ok 8         + ch_benchmarks_4                         17959 ms
ok 9         + ch_benchmarks_5                          1103 ms
ok 10        + ch_benchmarks_6                         18617 ms
# parallel group (2 tests):  intermediate_result_pruning_queries_2 intermediate_result_pruning_queries_1
ok 11        + intermediate_result_pruning_queries_1     1897 ms
ok 12        + intermediate_result_pruning_queries_2     1033 ms
# parallel group (2 tests):  dropped_columns_1 distributed_planning
ok 13        + dropped_columns_1                        1949 ms
ok 14        + distributed_planning                    38467 ms
# parallel group (3 tests):  nested_execution arbitrary_configs_recurring_outer_join local_dist_join
not ok 15    + local_dist_join                         18210 ms
# (test process exited with exit code 2)
ok 16        + nested_execution                          718 ms
ok 17        + arbitrary_configs_recurring_outer_join     1628 ms
# parallel group (2 tests):  citus_run_command connectivity_checks
not ok 18    + connectivity_checks                      1982 ms
# (test process exited with exit code 2)
not ok 19    + citus_run_command                        1978 ms
# (test process exited with exit code 2)
not ok 20    - schemas                                     2 ms
# (test process exited with exit code 2)
not ok 21    - views                                       7 ms
# (test process exited with exit code 2)
not ok 22    - sequences                                   2 ms
# (test process exited with exit code 2)
not ok 23    - functions                                   2 ms
# (test process exited with exit code 2)
not ok 24    - arbitrary_configs_truncate                  2 ms
# (test process exited with exit code 2)
not ok 25    - arbitrary_configs_truncate_cascade          2 ms
# (test process exited with exit code 2)
not ok 26    - arbitrary_configs_truncate_partition        2 ms
# (test process exited with exit code 2)
ok 27        - arbitrary_configs_alter_table_add_constraint_without_name       11 ms
ok 28        - merge_arbitrary                           935 ms
ok 29        - arbitrary_configs_router                  684 ms
1..29
# 10 of 29 tests failed.

Best regards, Egor Chindyaskin Postgres Professional: https://postgrespro.com/

saygoodbyye avatar Nov 19 '25 08:11 saygoodbyye

Could you please share a list of OS you are testing against and if these failures correlate with any particular OS. Do you clean up db in between test iterations?

ihalatci avatar Nov 20 '25 11:11 ihalatci