Intermittent failures of unstable citus regression tests
Hello! On our test system, Citus tests are run daily on a large number of operating systems. Virtual machines are created on a hypervisor with 4GB of RAM and 2 cores. This means that the machines may slow down under heavy concurrency loads. This allowed us to identify several unstable tests that crash frequently under these conditions, but less frequently on faster machines.
The following failures can be reproduced by simply running the test in a loop.
The experiments were conducted on the REL_17_STABLE + release-13.0 branch.
CFLAGS='-Og' ./configure --enable-tap-tests --enable-debug --with-openssl --with-icu --with-lz4 --with-libxml --quiet --prefix=$PGPREFIX
- multi_extension
for i in `seq 100000`; do echo "ITERATION $i"; make check-multi-1 || break; done
-->> not ok 1 - multi_extension 3908 ms
-->> # The differences that caused some tests to fail can be viewed in the file "/home/test/pg-tests/tmp/citus/check-multi-1/regression.diffs".
diff -dU10 -w /opt/pgpro/ent-17/src/citus-test/regress/expected/multi_extension.out /home/test/pg-tests/tmp/citus/check-multi-1/results/multi_extension.out
--- /home/test/pg-tests/tmp/citus/check-multi-1/expected/multi_extension.out.modified 2025-08-06 12:47:03.012315772 +0300
+++ /home/test/pg-tests/tmp/citus/check-multi-1/results/multi_extension.out.modified 2025-08-06 12:47:03.042316252 +0300
@@ -1618,21 +1618,21 @@
-- check that maintenance daemon gets (re-)started for the right user
DROP EXTENSION citus;
CREATE USER testuser SUPERUSER;
SET ROLE testuser;
CREATE EXTENSION citus;
SELECT datname, current_database(),
usename, (SELECT extowner::regrole::text FROM pg_extension WHERE extname = 'citus')
FROM test.maintenance_worker();
datname | current_database | usename | extowner
------------+------------------+-----------+----------
- regression | regression | testuser | testuser
+ regression | regression | testuser2 | testuser
(1 row)
-- and recreate as the right owner
RESET ROLE;
DROP EXTENSION citus;
CREATE EXTENSION citus;
-- Check that maintenance daemon can also be started in another database
CREATE DATABASE another;
NOTICE: Citus partially supports CREATE DATABASE for distributed databases
DETAIL: Citus does not propagate CREATE DATABASE command to workers
- single_node_enterprise
for i in `seq 100000`; do echo "ITERATION $i"; make check-enterprise || break; done
FAILED iterations 389, 689. Sometimes way more faster because of the concurrent test jobs.
To reproduce faster we can multiply test: single_node_enterprise in enterprise_schedule like below:
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
diff -dU10 -w /home/test/citus/src/test/regress/expected/single_node_enterprise.out /home/test/citus/src/test/regress/results/single_node_enterprise.out
--- /home/test/citus/src/test/regress/expected/single_node_enterprise.out.modified 2025-11-18 20:24:59.934533920 +0300
+++ /home/test/citus/src/test/regress/results/single_node_enterprise.out.modified 2025-11-18 20:24:59.942534037 +0300
@@ -465,28 +465,30 @@
NOTICE: issuing /*{"cId":22,"tId":"101"}*/INSERT INTO single_node_ent.test_90730501 (x, y) VALUES (101, 100)
INSERT INTO test(x,y) VALUES (102,100);
NOTICE: issuing /*{"cId":22,"tId":"102"}*/INSERT INTO single_node_ent.test_90730502 (x, y) VALUES (102, 100)
-- followed by a multi-shard command
SELECT count(*) FROM test;
NOTICE: issuing SELECT count(*) AS count FROM single_node_ent.test_90730501 test WHERE true
NOTICE: issuing SELECT count(*) AS count FROM single_node_ent.test_90730502 test WHERE true
NOTICE: issuing SELECT count(*) AS count FROM single_node_ent.test_90730503 test WHERE true
NOTICE: issuing SELECT count(*) AS count FROM single_node_ent.test_90731504 test WHERE true
NOTICE: issuing SELECT count(*) AS count FROM single_node_ent.test_90731505 test WHERE true
+NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 511, '2025-11-18 09:24:59.815208-08');
NOTICE: issuing SELECT count(*) AS count FROM single_node_ent.test_90731506 test WHERE true
count
-------
53
(1 row)
ROLLBACK;
NOTICE: issuing ROLLBACK
+NOTICE: issuing ROLLBACK
-- should fail as only read access is allowed
SET ROLE read_access_single_node;
INSERT INTO test VALUES (1, 1, (95, 'citus9.5')::new_type);
ERROR: permission denied for table test
SET ROLE postgres;
\c
SET search_path TO single_node_ent;
-- Cleanup
RESET citus.log_remote_commands;
SET client_min_messages TO WARNING;
- multi_explain
for i in `seq 100000`; do echo "ITERATION $i"; make check-multi || break; done
diff -dU10 -w /opt/pgpro/ent-17/src/citus-test/regress/expected/multi_explain.out /home/test/pg-tests/tmp/citus/check-multi/results/multi_explain.out
--- /home/test/pg-tests/tmp/citus/check-multi/expected/multi_explain.out.modified 2025-08-04 21:57:59.663048132 +0300
+++ /home/test/pg-tests/tmp/citus/check-multi/results/multi_explain.out.modified 2025-08-04 21:57:59.701049106 +0300
@@ -1106,40 +1106,40 @@
SET l_suppkey = 12
FROM orders_hash_part
WHERE orders_hash_part.o_orderkey = lineitem_hash_part.l_orderkey;
Custom Scan (Citus Adaptive)
Task Count: 4
Tasks Shown: One of 4
-> Task
Node: host=localhost port=57637 dbname=regression
-> Update on lineitem_hash_part_360041 lineitem_hash_part
-> Hash Join
- Hash Cond: (lineitem_hash_part.l_orderkey = orders_hash_part.o_orderkey)
- -> Seq Scan on lineitem_hash_part_360041 lineitem_hash_part
- -> Hash
+ Hash Cond: (orders_hash_part.o_orderkey = lineitem_hash_part.l_orderkey)
-> Seq Scan on orders_hash_part_360045 orders_hash_part
+ -> Hash
+ -> Seq Scan on lineitem_hash_part_360041 lineitem_hash_part
-- Test delete with subquery
EXPLAIN (COSTS FALSE)
DELETE FROM lineitem_hash_part
USING orders_hash_part
WHERE orders_hash_part.o_orderkey = lineitem_hash_part.l_orderkey;
Custom Scan (Citus Adaptive)
Task Count: 4
Tasks Shown: One of 4
-> Task
Node: host=localhost port=57637 dbname=regression
-> Delete on lineitem_hash_part_360041 lineitem_hash_part
-> Hash Join
- Hash Cond: (lineitem_hash_part.l_orderkey = orders_hash_part.o_orderkey)
- -> Seq Scan on lineitem_hash_part_360041 lineitem_hash_part
- -> Hash
+ Hash Cond: (orders_hash_part.o_orderkey = lineitem_hash_part.l_orderkey)
-> Seq Scan on orders_hash_part_360045 orders_hash_part
+ -> Hash
+ -> Seq Scan on lineitem_hash_part_360041 lineitem_hash_part
-- Test track tracker
EXPLAIN (COSTS FALSE)
SELECT avg(l_linenumber) FROM lineitem WHERE l_orderkey > 9030;
Aggregate
-> Custom Scan (Citus Adaptive)
Task Count: 2
Tasks Shown: One of 2
-> Task
Node: host=localhost port=57637 dbname=regression
-> Aggregate
- multi_cluster_management
for i in `seq 100000`; do echo "ITERATION $i"; make check-multi || break; done
Fails very often. When I've been testing this fails, it failed on 1'st iteration.
diff -dU10 -w /home/test/citus/src/test/regress/expected/multi_cluster_management.out /home/test/citus/src/test/regress/results/multi_cluster_management.out
--- /home/test/citus/src/test/regress/expected/multi_cluster_management.out.modified 2025-11-19 07:37:47.889673630 +0300
+++ /home/test/citus/src/test/regress/results/multi_cluster_management.out.modified 2025-11-19 07:37:47.905673868 +0300
@@ -84,20 +84,21 @@
1
(1 row)
SELECT citus_disable_node('localhost', :worker_2_port);
citus_disable_node
--------------------
(1 row)
SELECT public.wait_until_metadata_sync(60000);
+WARNING: waiting for metadata sync timed out
wait_until_metadata_sync
--------------------------
(1 row)
SELECT master_get_active_worker_nodes();
master_get_active_worker_nodes
--------------------------------
(localhost,57637)
(1 row)
- prepared_statements_1 and ch_benchmarks_3
for i in `seq 100000`; do echo "ITERATION $i"; make check-arbitrary-configs || break; done
diff -dU10 -w /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/expected/prepared_statements_1.out /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/results/prepared_statements_1.out
--- /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/expected/prepared_statements_1.out.modified 2025-08-08 10:59:21.098350547 +0300
+++ /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/results/prepared_statements_1.out.modified 2025-08-08 10:59:21.103350557 +0300
@@ -2,25 +2,22 @@
PREPARE repartition_prepared(int) AS
SELECT
count(*)
FROM
repartition_prepared_test t1
JOIN
repartition_prepared_test t2
USING (b)
WHERE t1.a = $1;
EXECUTE repartition_prepared (1);
- count
----------------------------------------------------------------------
- 100
-(1 row)
-
+ERROR: cannot connect to localhost:10246 to fetch intermediate results
+CONTEXT: while executing command on localhost:10245
BEGIN;
-- CREATE TABLE ... AS EXECUTE prepared_statement tests
CREATE TEMP TABLE repartition_prepared_tmp AS EXECUTE repartition_prepared(1);
SELECT count(*) from repartition_prepared_tmp;
count
-------
1
(1 row)
ROLLBACK;
diff -dU10 -w /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/expected/ch_benchmarks_3.out /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/results/ch_benchmarks_3.out
--- /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/expected/ch_benchmarks_3.out.modified 2025-08-08 10:59:30.743369547 +0300
+++ /home/test/pg-tests/tmp/citus/check-arbitrary-configs/tmp_citus_test/dataCitusNonMxClusterConfig/results/ch_benchmarks_3.out.modified 2025-08-08 10:59:30.748369557 +0300
@@ -1,26 +1,21 @@
SET search_path to "ch benchmarks";
-- Subquery + repartition is supported when it is an IN query where the subquery
-- returns unique results (because it's converted to an INNER JOIN)
select s_i_id
from stock, order_line
where
s_i_id in (select i_id from item)
AND s_i_id = ol_i_id
order by s_i_id;
- s_i_id
----------------------------------------------------------------------
- 1
- 2
- 3
-(3 rows)
-
+ERROR: cannot connect to localhost:10247 to fetch intermediate results
+CONTEXT: while executing command on localhost:10247
select su_name, su_address
from supplier, nation
where su_suppkey in
(select mod(s_i_id * s_w_id, 10000)
from stock, order_line
where s_i_id in
(select i_id
from item
where i_data like 'ab%')
and ol_i_id=s_i_id
- multi_mx_node_metadata
for i in `seq 100000`; do echo "ITERATION $i"; make check-multi-mx || break; done
diff -dU10 -w /opt/pgpro/ent-17/src/citus-test/regress/expected/multi_mx_node_metadata.out /home/test/pg-tests/tmp/citus/check-multi-mx/results/multi_mx_node_metadata.out
--- /home/test/pg-tests/tmp/citus/check-multi-mx/expected/multi_mx_node_metadata.out.modified 2025-08-05 09:02:39.131772001 +0300
+++ /home/test/pg-tests/tmp/citus/check-multi-mx/results/multi_mx_node_metadata.out.modified 2025-08-05 09:02:39.146772062 +0300
@@ -284,26 +284,23 @@
(1 row)
select wait_until_metadata_sync(30000);
wait_until_metadata_sync
--------------------------
(1 row)
select pid as pid_after_error from pg_stat_activity where application_name like 'Citus Met%' \gset
+no rows returned for \gset
select :pid_before_error != :pid_after_error AS metadata_sync_restarted;
- metadata_sync_restarted
----------------------------------------------------------------------
- t
-(1 row)
-
+ERROR: syntax error at or near ":" at character 17
SELECT trigger_metadata_sync();
trigger_metadata_sync
-----------------------
(1 row)
SELECT wait_until_metadata_sync(30000);
wait_until_metadata_sync
--------------------------
- multi_metadata_sync
for i in `seq 100000`; do echo "ITERATION $i"; make check-full || break; done
See attached regression.diffs
- This test fails because an OOM killer arrives and kills the process during the test. This occurs on a virtual machine with 4GB of RAM.
python3.10 -m pipenv run make check-arbitrary-configs CONFIGS=CitusThreeWorkersManyShardsClusterConfig
CitusThreeWorkersManyShardsClusterConfig: FAIL: see ./tmp_citus_test/dataCitusThreeWorkersManyShardsClusterConfig/run.out runtime: 288.97670102119446 seconds
test@ubuntu-2204:~/work/citus/src/test/regress/tmp_citus_test/dataCitusThreeWorkersManyShardsClusterConfig$ cat regression.out
# using postmaster on Unix socket, port 10215
# parallel group (3 tests): prepared_statements_3 prepared_statements_2 prepared_statements_1
ok 1 + prepared_statements_1 27788 ms
ok 2 + prepared_statements_2 7391 ms
ok 3 + prepared_statements_3 551 ms
ok 4 - prepared_statements_4 35234 ms
# parallel group (3 tests): ch_benchmarks_1 ch_benchmarks_3 ch_benchmarks_2
ok 5 + ch_benchmarks_1 32910 ms
ok 6 + ch_benchmarks_2 52325 ms
ok 7 + ch_benchmarks_3 51747 ms
# parallel group (3 tests): ch_benchmarks_5 ch_benchmarks_4 ch_benchmarks_6
ok 8 + ch_benchmarks_4 17959 ms
ok 9 + ch_benchmarks_5 1103 ms
ok 10 + ch_benchmarks_6 18617 ms
# parallel group (2 tests): intermediate_result_pruning_queries_2 intermediate_result_pruning_queries_1
ok 11 + intermediate_result_pruning_queries_1 1897 ms
ok 12 + intermediate_result_pruning_queries_2 1033 ms
# parallel group (2 tests): dropped_columns_1 distributed_planning
ok 13 + dropped_columns_1 1949 ms
ok 14 + distributed_planning 38467 ms
# parallel group (3 tests): nested_execution arbitrary_configs_recurring_outer_join local_dist_join
not ok 15 + local_dist_join 18210 ms
# (test process exited with exit code 2)
ok 16 + nested_execution 718 ms
ok 17 + arbitrary_configs_recurring_outer_join 1628 ms
# parallel group (2 tests): citus_run_command connectivity_checks
not ok 18 + connectivity_checks 1982 ms
# (test process exited with exit code 2)
not ok 19 + citus_run_command 1978 ms
# (test process exited with exit code 2)
not ok 20 - schemas 2 ms
# (test process exited with exit code 2)
not ok 21 - views 7 ms
# (test process exited with exit code 2)
not ok 22 - sequences 2 ms
# (test process exited with exit code 2)
not ok 23 - functions 2 ms
# (test process exited with exit code 2)
not ok 24 - arbitrary_configs_truncate 2 ms
# (test process exited with exit code 2)
not ok 25 - arbitrary_configs_truncate_cascade 2 ms
# (test process exited with exit code 2)
not ok 26 - arbitrary_configs_truncate_partition 2 ms
# (test process exited with exit code 2)
ok 27 - arbitrary_configs_alter_table_add_constraint_without_name 11 ms
ok 28 - merge_arbitrary 935 ms
ok 29 - arbitrary_configs_router 684 ms
1..29
# 10 of 29 tests failed.
Best regards, Egor Chindyaskin Postgres Professional: https://postgrespro.com/
Could you please share a list of OS you are testing against and if these failures correlate with any particular OS. Do you clean up db in between test iterations?