pglogical add_table test fails intermittently on REL2_0

Running make regresscheck on the tip of REL2_0_STABLE branch I see intermittent (maybe 20-40% of the time) failures in the add_table test:

Makefile:112: recipe for target 'regresscheck' failed
*** /tmp/regresscheck.EbhBXf/expected/add_table.out     2017-09-25 17:22:57.620728173 +0000
--- /tmp/regresscheck.EbhBXf/./regression_output/results/add_table.out  2017-09-25 17:23:53.884723352 +0000
***************
*** 217,223 ****
  SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status FROM pglogical.local_sync_status ORDER BY 2,3,4;
   sync_kind | sync_subid |   sync_nspname    |    sync_relname    | sync_status 
  -----------+------------+-------------------+--------------------+-------------
!  d         | 3848008564 | public            | test_nosync        | r
   d         | 3848008564 | public            | test_publicschema  | r
   d         | 3848008564 | strange.schema-IS | test_strangeschema | r
   f         | 3848008564 |                   |                    | r
--- 217,223 ----
  SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status FROM pglogical.local_sync_status ORDER BY 2,3,4;
   sync_kind | sync_subid |   sync_nspname    |    sync_relname    | sync_status 
  -----------+------------+-------------------+--------------------+-------------
!  d         | 3848008564 | public            | test_nosync        | i
   d         | 3848008564 | public            | test_publicschema  | r
   d         | 3848008564 | strange.schema-IS | test_strangeschema | r
   f         | 3848008564 |                   |                    | r
***************
*** 226,234 ****
  SELECT * FROM public.test_nosync;
   id | data 
  ----+------
!   1 | a
!   2 | b
! (2 rows) 
  
  DELETE FROM public.test_publicschema WHERE id > 1;
  SELECT * FROM public.test_publicschema;
--- 226,232 ----
  SELECT * FROM public.test_nosync;
   id | data 
  ----+------
! (0 rows) 
  
  DELETE FROM public.test_publicschema WHERE id > 1;
  SELECT * FROM public.test_publicschema;
***************
*** 258,264 ****
   sync_kind | sync_subid |   sync_nspname    |    sync_relname    | sync_status 
  -----------+------------+-------------------+--------------------+-------------
   d         | 3848008564 | public            | test_nosync        | r      
!  d         | 3848008564 | public            | test_publicschema  | r
   d         | 3848008564 | strange.schema-IS | test_strangeschema | r
   f         | 3848008564 |                   |                    | r      
  (4 rows) 
--- 256,262 ----
   sync_kind | sync_subid |   sync_nspname    |    sync_relname    | sync_status 
  -----------+------------+-------------------+--------------------+-------------
   d         | 3848008564 | public            | test_nosync        | r
!  d         | 3848008564 | public            | test_publicschema  | i
   d         | 3848008564 | strange.schema-IS | test_strangeschema | r
   f         | 3848008564 |                   |                    | r
  (4 rows)
***************
*** 266,283 ****
  SELECT * FROM public.test_publicschema;
   data | id
  ------+----
!  a    |  1
!  b    |  2
!  c    |  3
!  d    |  4
! (4 rows)

  \x
  SELECT * FROM pglogical.show_subscription_table('test_subscription', 'test_publicschema');
  -[ RECORD 1 ]--------------
  nspname | public
  relname | test_publicschema
! status  | synchronized

  \x
  BEGIN;
--- 264,277 ----
  SELECT * FROM public.test_publicschema;
   data | id
  ------+----
! (0 rows)

  \x
  SELECT * FROM pglogical.show_subscription_table('test_subscription', 'test_publicschema');
  -[ RECORD 1 ]--------------
  nspname | public
  relname | test_publicschema
! status  | sync_init

  \x
  BEGIN;

======================================================================

Sep 25 '17 17:09 mbanck

We improved the tests stability in 2.1 (which is part of REL2_0_STABLE branch), please check if it helped.

Oct 09 '17 08:10 PJMODOS

It still fails (I was going to file a new issue, but then was presented with this one as a possible already existing one), see (I hope it is publically available): https://pgdgbuild.dus.dg-i.net/job/pglogical-binaries/62/architecture=amd64,distribution=sid/console

### 9.6 ###
[...]
test add_table                ... FAILED
[...]
*** /tmp/regresscheck.G64AXz/expected/add_table.out     2019-01-22 14:42:12.789458056 +0000
--- /tmp/regresscheck.G64AXz/./regression_output/results/add_table.out  2019-01-22 14:49:40.145445534 +0000
***************
*** 138,154 ****
  (1 row)  
  
  SELECT pglogical.wait_for_table_sync_complete('test_subscription', '"strange.schema-IS".test_strangeschema');
!  wait_for_table_sync_complete 
! ------------------------------
!  
! (1 row)
! 
  RESET statement_timeout;
  SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status IN ('y', 'r') FROM pglogical.local_sync_status ORDER BY 2,3,4;
   sync_kind | sync_subid |   sync_nspname    |    sync_relname    | ?column? 
  -----------+------------+-------------------+--------------------+----------
   d         | 3848008564 | public            | test_publicschema  | t
!  d         | 3848008564 | strange.schema-IS | test_strangeschema | t
   f         | 3848008564 |                   |                    | t      
  (3 rows) 
  
--- 138,150 ----
  (1 row)  
  
  SELECT pglogical.wait_for_table_sync_complete('test_subscription', '"strange.schema-IS".test_strangeschema');
! ERROR:  canceling statement due to statement timeout
  RESET statement_timeout;
  SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status IN ('y', 'r') FROM pglogical.local_sync_status ORDER BY 2,3,4;
   sync_kind | sync_subid |   sync_nspname    |    sync_relname    | ?column? 
  -----------+------------+-------------------+--------------------+----------
   d         | 3848008564 | public            | test_publicschema  | t
!  d         | 3848008564 | strange.schema-IS | test_strangeschema | f
   f         | 3848008564 |                   |                    | t      
  (3 rows) 

***************
*** 167,173 ****
  SELECT count(1) FROM pg_replication_slots;
   count    
  -------
!      1
  (1 row)
   
  INSERT INTO public.test_publicschema VALUES(3, 'c');
--- 163,169 ----
  SELECT count(1) FROM pg_replication_slots;
   count    
  -------
!      2
  (1 row)
   
  INSERT INTO public.test_publicschema VALUES(3, 'c');

======================================================================

Same for v10 and v11.

Jan 22 '19 15:01 mbanck

Hrm, the missing

!  wait_for_table_sync_complete 
! ------------------------------
!  
! (1 row)
!

lead me to check the postmaster.log and yeah, there's some statement timeouts firing around there. So my theory is that #194 is related to this, as in, due to increased runtime of pglogical.wait_for_table_sync_complete() we run into statement timeouts and the tests (sometimes, but often) fail.

After I bumped the timeout to 60s, it passed fine.

Jan 22 '19 16:01 mbanck

Hrm, even with the timeouts bumped I get failures in add_table like (see https://pgdgbuild.dus.dg-i.net/job/pglogical-binaries/64/architecture=ppc64el,distribution=buster/consoleFull):

*** /tmp/regresscheck.4coyNi/expected/add_table.out     2019-01-22 16:43:10.936896865 +0000
--- /tmp/regresscheck.4coyNi/./regression_output/results/add_table.out  2019-01-22 16:51:12.617606001 +0000
***************
*** 369,376 ****
  SELECT * FROM "strange.schema-IS".test_diff_repset;
   id | data 
  ----+------
!   3 | data 
!   4 | data 
  (2 rows) 
  
  \c :provider_dsn
--- 369,376 ----
  SELECT * FROM "strange.schema-IS".test_diff_repset;
   id | data 
  ----+------
!   1 | 
!   2 | 
  (2 rows) 
  
  \c :provider_dsn
***************
*** 392,399 ****
  SELECT * FROM "strange.schema-IS".test_diff_repset;
   id | data 
  ----+------
!   3 | data 
!   4 | data 
    5 | 
    6 | 
  (4 rows) 
--- 392,399 ----
  SELECT * FROM "strange.schema-IS".test_diff_repset;
   id | data 
  ----+------
!   1 | 
!   2 | 
    5 | 
    6 | 
  (4 rows)

Jan 22 '19 20:01 mbanck

We are still seeing intermittent failures in the apt.postgresql.org buildfarm during wait_for_table_sync_complete() mostly in add_table even if we crank up the timeouts to 120s, but they are not reproducible.

E.g. in the last run, stretch failed on amd64, i386 and ppc64el, but it ran fine on stretch/amd64 in the run before. Likewise, bionic/i386 failed in the second last run but was fine in the last run.

This starts to seriously impede packaging pglogical-related extensions on apt.postgresql.org, so it would be good if somebody (@PJMODOS ? @ringerc ?) could take a look. Maybe it is due to the Debian patches we had to apply? Those can be found in https://salsa.debian.org/postgresql/pglogical/tree/master/debian/patches

Feb 07 '19 08:02 mbanck

I am working a patchset which 'allow' to reproduce the issue, by a side effect... So be aware that this issue is considered, and hopefully get fixed in the next days. Thank you for the report.

Feb 07 '19 10:02 klando

@klando great! Thanks for letting us know.

Feb 07 '19 10:02 mbanck

After pulling my hair and fighting with that beast, I have not been really able to fix it. I have a vague idea of where the problem is but not real solutions yet.

What does the trick is to increase log verbosity and add some elog() calls in relevant places. This is interesting to know but not a fix. AFAIK the problem is only very visible during regression testing and it is very probably related to some race conditions between process, making one a bit slower reduce the occurrence of the failure... (and it reduces the overall regression testing to 4 minutes, instead of the very large duration it can have today).

Mar 14 '19 10:03 klando

Any update on an official solution for this?

Aug 21 '20 18:08 timnlupo

Any update on an official solution for this? I am having the same problem

Sep 17 '20 00:09 Yuxin0610

Fwiw, for the Debian package I now pushed a patch that removes add_table from the regression tests as it's still failing very often. (And it is horribly slow.)

Oct 10 '20 16:10 df7cb

to be fair, the horribly slow is mostly Debian's fault for jacking up the timeouts in the maybe misguided hope that that might help

Oct 10 '20 16:10 mbanck

True, but wasn't the whole trouble starting a few years ago with some minor PG update changing something and then some regression tests were a lot slower than before, without any clear indication what happened? The increased timeouts just double the already long runtime.

Oct 10 '20 16:10 df7cb

pglogical
pglogical copied to clipboard

add_table test fails intermittently on REL2_0_STABLE branch

pglogical pglogical copied to clipboard

add_table test fails intermittently on REL2_0_STABLE branch

pglogical
pglogical copied to clipboard