pglogical icon indicating copy to clipboard operation
pglogical copied to clipboard

Downed replication during pg_repack unrecoverable

Open jfinzel opened this issue 8 years ago • 7 comments
trafficstars

I have a strange issue that appeared today.

We are replicating from 9.4.10 (pglogical 1.2.1) to 9.6.2 (pglogical 1.2.2). It is extremely invasive for us to restart our provider cluster, so we are trying to avoid that at all costs.

This is what happened. We were running pg_repack on some of our big tables on provider. We have done this before and had no issues with pglogical. But we had replication go down suddenly around 7:30pm last night. This was during one big table being repacked. This morning, one of our replication streams showed status "replicating", however, the replication delay was growing. It wasn't successfully consuming changes at all from the provider.

I found this in the logs on the subscriber since the issue started: 2017-06-19 17:52:06.039 CDT,,,53661,,592af564.d19d,3,,2017-05-28 11:05:56 CDT,9/0,0,ERROR,XX000,"epoll_ctl() failed: Invalid argument",,,,,,,,,"pglogical apply 16400:3747574610"

There is very little more helpful information found in the logs either on provider or subscriber.

With our WAL volume growing, we were forced to break replication to free up space. To do this, I ran pglogical.drop_subscription on subscriber, but had to manually run pg_drop_replication_slot on provider.

Now, I cannot get replication working again between systems no matter what I do. The subscriber is not connecting to provider to create the replication slot. The create subscription seems to work but no replication slot is created, and workers never start replicating. I have tried:

  • Recreating subscription several times
  • Creating slot manually on provider and adding to pg_replication_origin on subscriber
  • Recreating node on provider (including creating node with different name)
  • Restarting subscriber cluster
  • Replicating using a new replication set name
  • Replicating the same data to a different subscriber. I WAS able to get this working, however it doesn't tell me what is wrong still. This different subscriber did successfully create replication slot on provider.

Any suggestions much appreciated. I can't give much more detail because this isn't something I can reproduce. I can't seem to get replication working no matter what I do.

jfinzel avatar Jun 21 '17 02:06 jfinzel

Please save the full logs. The epoll_ctl() failed error just indicates a broken socket, most likely terminated by the other end, so look for corresponding output there.

Did you by any chance capture core files (gcore) of an apply worker and output plugin before ending replication?

If you run with log_min_messages = debug1, do you see anything more interesting? You can change log_min_messages without a restart, it only needs a soft reload.

When trying to recreate the subscription are you using a new, blank downstream database? What's in the logs?

ringerc avatar Jun 21 '17 02:06 ringerc

From quick look at pg_repack, the way it does table rewrite is almost guaranteed to break logical decoding unless there is zero unconsumed changes for a given table as it does not build the necessary mappings info for logical decoding that standard heap rewrite in postgres does.

I am afraid there is nothing we can do about that inside pglogical, it's issue with pg_repack.

PJMODOS avatar Jun 21 '17 06:06 PJMODOS

Thank you both so much for the feedback.

@ringerc I definitely didn't capture core files. I will keep that in mind for future.

I found this in the provider logs at the same time I had the other error above - this again occurred while the big table was being repacked:

2017-06-19 17:52:06.039 CDT,"foouser","db",26684,"0.0.0.0:0",592af564.683c,5,"idle",2017-05-28 11:05:56 CDT,62/0,0,LOG,00000,"terminating walsender process due to replication timeout",,,,,,,,,"foo_sub_sub_g"
2017-06-19 17:52:06.039 CDT,"foouser","db",26684,"0.0.0.0:0",592af564.683c,6,"idle",2017-05-28 11:05:56 CDT,,0,LOG,00000,"disconnection: session time: 534:46:09.387 user=foouser database=db host=0.0.0.0 port=0",,,,,,,,,"foo_sub_sub_g"
2017-06-19 17:52:11.127 CDT,"foouser","db",74564,"0.0.0.0:0",5948559b.12344,2,"authentication",2017-06-19 17:52:11 CDT,62/54722775,0,LOG,00000,"replication connection authorized: user=foouser SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=on)",,,,,,,,,""
2017-06-19 17:52:11.134 CDT,"foouser","db",74564,"0.0.0.0:0",5948559b.12344,3,"idle",2017-06-19 17:52:11 CDT,62/0,0,LOG,00000,"starting logical decoding for slot ""pgl_sub_provider5f142af_foo_subdf5f735""","streaming transactions committing after 506/B3B1B5D0, reading WAL from 4FA/F71C0688",,,,,,,,"foo_sub_sub_g"
2017-06-19 17:52:11.136 CDT,"foouser","db",74564,"0.0.0.0:0",5948559b.12344,4,"idle",2017-06-19 17:52:11 CDT,62/0,0,LOG,00000,"logical decoding found consistent point at 4FA/F71C0688","There are no running transactions.",,,,,,,,"foo_sub_sub_g"

Even though the workers restarted, it looks like from this point on no data was being replicated to subscriber.

To answer your question, no it is not a blank downstream. I am trying to keep all of the data replicated already for the tables that had broken replication, and fix the deltas. Also, there are 3 other WORKING replication streams in the same database. For those reasons, I couldn't start from scratch (as I mentioned above, I WAS able to get replication working to a fresh blank downstream database. But I couldn't get it working on the subscriber of interest). However, see below.

So last night, when playing with this, after I turned on logging to debug1 on both provider and subscriber and reloaded configs, when I recreated the subscription, it just started working again. I can't explain why. I don't know that I did anything significant, although I hadn't done a config reload on the provider.

At this point not sure if this will be of interest, but here are the logs when I did this and it actually worked: Provider:

2017-06-19 17:52:06.039 CDT,"foouser","db",26684,"0.0.0.0:0",592af564.683c,5,"idle",2017-05-28 11:05:56 CDT,62/0,0,LOG,00000,"terminating walsender process due to replication timeout",,,,,,,,,"foo_sub_sub_g"
2017-06-19 17:52:06.039 CDT,"foouser","db",26684,"0.0.0.0:0",592af564.683c,6,"idle",2017-05-28 11:05:56 CDT,,0,LOG,00000,"disconnection: session time: 534:46:09.387 user=foouser database=db host=0.0.0.0 port=0",,,,,,,,,"foo_sub_sub_g"
2017-06-19 17:52:11.127 CDT,"foouser","db",74564,"0.0.0.0:0",5948559b.12344,2,"authentication",2017-06-19 17:52:11 CDT,62/54722775,0,LOG,00000,"replication connection authorized: user=foouser SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=on)",,,,,,,,,""
2017-06-19 17:52:11.134 CDT,"foouser","db",74564,"0.0.0.0:0",5948559b.12344,3,"idle",2017-06-19 17:52:11 CDT,62/0,0,LOG,00000,"starting logical decoding for slot ""pgl_sub_provider5f142af_foo_subdf5f735""","streaming transactions committing after 506/B3B1B5D0, reading WAL from 4FA/F71C0688",,,,,,,,"foo_sub_sub_g"
2017-06-19 17:52:11.136 CDT,"foouser","db",74564,"0.0.0.0:0",5948559b.12344,4,"idle",2017-06-19 17:52:11 CDT,62/0,0,LOG,00000,"logical decoding found consistent point at 4FA/F71C0688","There are no running transactions.",,,,,,,,"foo_sub_sub_g"

Subscriber:

2017-06-20 22:38:11.061 CDT,,,21246,,59497637.52fe,16,,2017-06-20 14:23:35 CDT,,0,DEBUG,00000,"registering background worker ""pglogical apply 16400:2708683776""",,,,,,,,,""
2017-06-20 22:38:11.062 CDT,,,21246,,59497637.52fe,17,,2017-06-20 14:23:35 CDT,,0,DEBUG,00000,"starting background worker process ""pglogical apply 16400:2708683776""",,,,,,,,,""
2017-06-20 22:38:11.067 CDT,,,43054,,5949ea23.a82e,1,,2017-06-20 22:38:11 CDT,10/0,0,LOG,00000,"starting apply for subscription foo_sub_sub_b",,,,,,,,,"pglogical apply 16400:2708683776"
2017-06-20 22:38:11.068 CDT,,,43054,,5949ea23.a82e,2,,2017-06-20 22:38:11 CDT,10/0,0,INFO,00000,"initializing subscriber foo_sub_sub_b",,,,,,,,,"pglogical apply 16400:2708683776"
2017-06-20 22:38:11.109 CDT,,,43054,,5949ea23.a82e,3,,2017-06-20 22:38:11 CDT,10/0,0,INFO,00000,"synchronizing data",,,,,,,,,"pglogical apply 16400:2708683776"
2017-06-20 22:38:11.143 CDT,,,43054,,5949ea23.a82e,4,,2017-06-20 22:38:11 CDT,10/0,0,INFO,00000,"finished synchronization of subscriber foo_sub_sub_b, ready to enter normal replication",,,,,,,,,"pglogical apply 16400:2708683776"
2017-06-20 22:38:11.143 CDT,,,43054,,5949ea23.a82e,5,,2017-06-20 22:38:11 CDT,10/0,0,DEBUG,00000,"connecting to provider provider:cluster:db, dsn host=10.7.198.200 dbname=db port=0 user=foouser",,,,,,,,,"pglogical apply 16400:2708683776"
2017-06-20 22:38:11.967 CDT,,,43054,,5949ea23.a82e,6,,2017-06-20 22:38:11 CDT,10/0,0,DEBUG,00000,"upstream Pg version is 9.4.9",,,,,,,,,"pglogical apply 16400:2708683776"
2017-06-20 22:38:11.967 CDT,,,43054,,5949ea23.a82e,7,,2017-06-20 22:38:11 CDT,10/0,0,DEBUG,00000,"changeset origin forwarding enabled: f",,,,,,,,,"pglogical apply 16400:2708683776"
2017-06-20 22:38:22.692 CDT,,,43325,,5949ea2e.a93d,1,,2017-06-20 22:38:22 CDT,11/0,0,DEBUG,00000,"autovacuum: processing database ""postgres""",,,,,,,,,""
2017-06-20 22:38:37.718 CDT,,,43483,,5949ea3d.a9db,1,,2017-06-20 22:38:37 CDT,11/0,0,DEBUG,00000,"autovacuum: processing database ""db""",,,,,,,,,""
2017-06-20 22:38:48.287 CDT,,,21258,,59497637.530a,862,,2017-06-20 14:23:35 CDT,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2017-06-20 22:38:48.288 CDT,,,21258,,59497637.530a,863,,2017-06-20 14:23:35 CDT,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""
2017-06-20 22:38:48.342 CDT,,,21258,,59497637.530a,864,,2017-06-20 14:23:35 CDT,,0,DEBUG,00000,"transaction log switch forced (archive_timeout=300)",,,,,,,,,""
2017-06-20 22:38:48.749 CDT,,,21262,,59497637.530e,7,,2017-06-20 14:23:35 CDT,,0,DEBUG,00000,"archived transaction log file ""000000010000013D000000BE""",,,,,,,,,""
2017-06-20 22:38:52.430 CDT,,,21246,,59497637.52fe,18,,2017-06-20 14:23:35 CDT,,0,DEBUG,00000,"registering background worker ""pglogical manager 12409""",,,,,,,,,""
2017-06-20 22:38:52.430 CDT,,,21246,,59497637.52fe,19,,2017-06-20 14:23:35 CDT,,0,DEBUG,00000,"starting background worker process ""pglogical manager 12409""",,,,,,,,,""
2017-06-20 22:38:52.434 CDT,,,43877,,5949ea4c.ab65,1,,2017-06-20 22:38:52 CDT,11/0,0,LOG,00000,"manager worker [43877] at slot 5 generation 1 detaching cleanly",,,,,,,,,"pglogical manager 12409"
2017-06-20 22:38:52.435 CDT,,,21246,,59497637.52fe,20,,2017-06-20 14:23:35 CDT,,0,DEBUG,00000,"worker process: pglogical manager 12409 (PID 43877) exited with exit code 0",,,,,,,,,""

jfinzel avatar Jun 21 '17 19:06 jfinzel

@PJMODOS thanks much for the feedback. As you see I posted an issue on their repo. It's a nice extension so hopefully we will see progress on this.

jfinzel avatar Jun 21 '17 19:06 jfinzel

@PJMODOS Have you got any suggestion about fixing pg_repack to avoid breaking logical replication? Thank you.

dvarrazzo avatar Sep 25 '17 02:09 dvarrazzo

FYI I had a similar problem in production a few days ago. Didn't really look into what was erroring into the server, but only managed to restart replication by dropping the replication slots and resync the tables. I am eager to solve this problem in pg_repack so any help would be welcome, thank you.

dvarrazzo avatar Sep 25 '17 10:09 dvarrazzo

@PJMODOS @dvarrazzo it has been 2 years since this happened, I'm actually still not convinced this has anything to do with pg_repack. I am convinced the only reason above issue happened was due to wal_sender_timeout. Unless you raise this timeout, the provider just keeps erroring and repeating without end.

We have since this happened raised our wal_sender_timeout across all clusters to 1 hour, and never seen this issue again.

I don't really agree with the assessment @PJMODOS because we have never seen pg_repack break logical decoding even with extremely high traffic tables and huge tables. So IF there is a problem, it definitely seems extremely unlikely to break given our own experience. As I mentioned, I have since come to understand this issue we faced years ago was due merely to wal_sender_timeout. Actually, I can't really see how pg_repack could break logical decoding. Every row-level change for the given relation still goes through WAL like usual. Nothing is lost. It works as normal while the main table is being repacked, then the table is locked and relfilenode swaps happen, then the new traffic from that point on is decoded as WAL traffic on the same relation, even though the relfilenode changed.

I don't know how the C code works here, but our experience is that logical replication does NOT break logical decoding. It would be really important for the community to be able to close this issue if it's truly a non-issue, as it appears to be from our assessment...

jfinzel avatar Oct 07 '19 15:10 jfinzel