pglogical
pglogical copied to clipboard
add_table test fails intermittently on REL2_0_STABLE branch
Running make regresscheck on the tip of REL2_0_STABLE branch I see intermittent (maybe 20-40% of the time) failures in the add_table test:
Makefile:112: recipe for target 'regresscheck' failed
*** /tmp/regresscheck.EbhBXf/expected/add_table.out 2017-09-25 17:22:57.620728173 +0000
--- /tmp/regresscheck.EbhBXf/./regression_output/results/add_table.out 2017-09-25 17:23:53.884723352 +0000
***************
*** 217,223 ****
SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status FROM pglogical.local_sync_status ORDER BY 2,3,4;
sync_kind | sync_subid | sync_nspname | sync_relname | sync_status
-----------+------------+-------------------+--------------------+-------------
! d | 3848008564 | public | test_nosync | r
d | 3848008564 | public | test_publicschema | r
d | 3848008564 | strange.schema-IS | test_strangeschema | r
f | 3848008564 | | | r
--- 217,223 ----
SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status FROM pglogical.local_sync_status ORDER BY 2,3,4;
sync_kind | sync_subid | sync_nspname | sync_relname | sync_status
-----------+------------+-------------------+--------------------+-------------
! d | 3848008564 | public | test_nosync | i
d | 3848008564 | public | test_publicschema | r
d | 3848008564 | strange.schema-IS | test_strangeschema | r
f | 3848008564 | | | r
***************
*** 226,234 ****
SELECT * FROM public.test_nosync;
id | data
----+------
! 1 | a
! 2 | b
! (2 rows)
DELETE FROM public.test_publicschema WHERE id > 1;
SELECT * FROM public.test_publicschema;
--- 226,232 ----
SELECT * FROM public.test_nosync;
id | data
----+------
! (0 rows)
DELETE FROM public.test_publicschema WHERE id > 1;
SELECT * FROM public.test_publicschema;
***************
*** 258,264 ****
sync_kind | sync_subid | sync_nspname | sync_relname | sync_status
-----------+------------+-------------------+--------------------+-------------
d | 3848008564 | public | test_nosync | r
! d | 3848008564 | public | test_publicschema | r
d | 3848008564 | strange.schema-IS | test_strangeschema | r
f | 3848008564 | | | r
(4 rows)
--- 256,262 ----
sync_kind | sync_subid | sync_nspname | sync_relname | sync_status
-----------+------------+-------------------+--------------------+-------------
d | 3848008564 | public | test_nosync | r
! d | 3848008564 | public | test_publicschema | i
d | 3848008564 | strange.schema-IS | test_strangeschema | r
f | 3848008564 | | | r
(4 rows)
***************
*** 266,283 ****
SELECT * FROM public.test_publicschema;
data | id
------+----
! a | 1
! b | 2
! c | 3
! d | 4
! (4 rows)
\x
SELECT * FROM pglogical.show_subscription_table('test_subscription', 'test_publicschema');
-[ RECORD 1 ]--------------
nspname | public
relname | test_publicschema
! status | synchronized
\x
BEGIN;
--- 264,277 ----
SELECT * FROM public.test_publicschema;
data | id
------+----
! (0 rows)
\x
SELECT * FROM pglogical.show_subscription_table('test_subscription', 'test_publicschema');
-[ RECORD 1 ]--------------
nspname | public
relname | test_publicschema
! status | sync_init
\x
BEGIN;
======================================================================
We improved the tests stability in 2.1 (which is part of REL2_0_STABLE branch), please check if it helped.
It still fails (I was going to file a new issue, but then was presented with this one as a possible already existing one), see (I hope it is publically available): https://pgdgbuild.dus.dg-i.net/job/pglogical-binaries/62/architecture=amd64,distribution=sid/console
### 9.6 ###
[...]
test add_table ... FAILED
[...]
*** /tmp/regresscheck.G64AXz/expected/add_table.out 2019-01-22 14:42:12.789458056 +0000
--- /tmp/regresscheck.G64AXz/./regression_output/results/add_table.out 2019-01-22 14:49:40.145445534 +0000
***************
*** 138,154 ****
(1 row)
SELECT pglogical.wait_for_table_sync_complete('test_subscription', '"strange.schema-IS".test_strangeschema');
! wait_for_table_sync_complete
! ------------------------------
!
! (1 row)
!
RESET statement_timeout;
SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status IN ('y', 'r') FROM pglogical.local_sync_status ORDER BY 2,3,4;
sync_kind | sync_subid | sync_nspname | sync_relname | ?column?
-----------+------------+-------------------+--------------------+----------
d | 3848008564 | public | test_publicschema | t
! d | 3848008564 | strange.schema-IS | test_strangeschema | t
f | 3848008564 | | | t
(3 rows)
--- 138,150 ----
(1 row)
SELECT pglogical.wait_for_table_sync_complete('test_subscription', '"strange.schema-IS".test_strangeschema');
! ERROR: canceling statement due to statement timeout
RESET statement_timeout;
SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status IN ('y', 'r') FROM pglogical.local_sync_status ORDER BY 2,3,4;
sync_kind | sync_subid | sync_nspname | sync_relname | ?column?
-----------+------------+-------------------+--------------------+----------
d | 3848008564 | public | test_publicschema | t
! d | 3848008564 | strange.schema-IS | test_strangeschema | f
f | 3848008564 | | | t
(3 rows)
***************
*** 167,173 ****
SELECT count(1) FROM pg_replication_slots;
count
-------
! 1
(1 row)
INSERT INTO public.test_publicschema VALUES(3, 'c');
--- 163,169 ----
SELECT count(1) FROM pg_replication_slots;
count
-------
! 2
(1 row)
INSERT INTO public.test_publicschema VALUES(3, 'c');
======================================================================
Same for v10 and v11.
Hrm, the missing
! wait_for_table_sync_complete
! ------------------------------
!
! (1 row)
!
lead me to check the postmaster.log and yeah, there's some statement timeouts firing around there. So my theory is that #194 is related to this, as in, due to increased runtime of pglogical.wait_for_table_sync_complete() we run into statement timeouts and the tests (sometimes, but often) fail.
After I bumped the timeout to 60s, it passed fine.
Hrm, even with the timeouts bumped I get failures in add_table like (see https://pgdgbuild.dus.dg-i.net/job/pglogical-binaries/64/architecture=ppc64el,distribution=buster/consoleFull):
*** /tmp/regresscheck.4coyNi/expected/add_table.out 2019-01-22 16:43:10.936896865 +0000
--- /tmp/regresscheck.4coyNi/./regression_output/results/add_table.out 2019-01-22 16:51:12.617606001 +0000
***************
*** 369,376 ****
SELECT * FROM "strange.schema-IS".test_diff_repset;
id | data
----+------
! 3 | data
! 4 | data
(2 rows)
\c :provider_dsn
--- 369,376 ----
SELECT * FROM "strange.schema-IS".test_diff_repset;
id | data
----+------
! 1 |
! 2 |
(2 rows)
\c :provider_dsn
***************
*** 392,399 ****
SELECT * FROM "strange.schema-IS".test_diff_repset;
id | data
----+------
! 3 | data
! 4 | data
5 |
6 |
(4 rows)
--- 392,399 ----
SELECT * FROM "strange.schema-IS".test_diff_repset;
id | data
----+------
! 1 |
! 2 |
5 |
6 |
(4 rows)
We are still seeing intermittent failures in the apt.postgresql.org buildfarm during wait_for_table_sync_complete() mostly in add_table even if we crank up the timeouts to 120s, but they are not reproducible.
E.g. in the last run, stretch failed on amd64, i386 and ppc64el, but it ran fine on stretch/amd64 in the run before. Likewise, bionic/i386 failed in the second last run but was fine in the last run.
This starts to seriously impede packaging pglogical-related extensions on apt.postgresql.org, so it would be good if somebody (@PJMODOS ? @ringerc ?) could take a look. Maybe it is due to the Debian patches we had to apply? Those can be found in https://salsa.debian.org/postgresql/pglogical/tree/master/debian/patches
I am working a patchset which 'allow' to reproduce the issue, by a side effect... So be aware that this issue is considered, and hopefully get fixed in the next days. Thank you for the report.
@klando great! Thanks for letting us know.
After pulling my hair and fighting with that beast, I have not been really able to fix it. I have a vague idea of where the problem is but not real solutions yet.
What does the trick is to increase log verbosity and add some elog() calls in relevant places. This is interesting to know but not a fix. AFAIK the problem is only very visible during regression testing and it is very probably related to some race conditions between process, making one a bit slower reduce the occurrence of the failure... (and it reduces the overall regression testing to 4 minutes, instead of the very large duration it can have today).
Any update on an official solution for this?
Any update on an official solution for this? I am having the same problem
Fwiw, for the Debian package I now pushed a patch that removes add_table from the regression tests as it's still failing very often. (And it is horribly slow.)
to be fair, the horribly slow is mostly Debian's fault for jacking up the timeouts in the maybe misguided hope that that might help
True, but wasn't the whole trouble starting a few years ago with some minor PG update changing something and then some regression tests were a lot slower than before, without any clear indication what happened? The increased timeouts just double the already long runtime.