ruby-pg icon indicating copy to clipboard operation
ruby-pg copied to clipboard

wait_socket_readable locks up forever on DNS failover

Open ged opened this issue 6 years ago • 5 comments
trafficstars

Original report by Jeff Wallace (Bitbucket: tjwallace, GitHub: tjwallace).


We've been testing failover with PostgresQL on Aiven, and noticed that when DNS failover happens, active queries get stuck. It looks like it's coming from wait_socket_readable and then rb_wait_for_single_fd. Here is some debugging output: https://gist.github.com/tjwallace/ddd12ae8ffb03c248f8aa22af9ab8789.

I’m not sure if this is a ruby-pg issue or a ruby issue, but thought I’d drop it here as a start.

ged avatar Jun 04 '19 22:06 ged

Original comment by Chris Bandy (Bitbucket: cbandy, GitHub: cbandy).


Jeff, when you say “DNS failover” do you mean that PostgreSQL has failed over to a replica and also a DNS record has changed?

  • How long did you wait for this function to return?
  • Are you setting any TCP keepalive parameters?
  • It looks like you're on macOS. What output do you get from sysctl -A | grep net.inet.tcp.*keep?

ged avatar Jun 05 '19 02:06 ged

Original comment by Jeff Wallace (Bitbucket: tjwallace, GitHub: tjwallace).


We are simulating a master failover by upgrading our database instance. This causes the DNS record to change. Here are the docs.

How long did you wait for this function to return?

Around 50 minutes. The lockup only seems to only happen in our Sidekiq workers (not the Rails/Puma instances), and only when the workers are under constant load during the failover.

Are you setting any TCP keepalive parameters?

We've tried with and without keepalive parameters.

default: &default
  adapter: postgresql
  encoding: unicode
  # For details on connection pooling, see Rails configuration guide
  # http://guides.rubyonrails.org/configuring.html#database-pooling
  pool: <%= ENV.fetch('RAILS_MAX_THREADS') %>
  url: <%= ENV.fetch('DATABASE_URL') %>
  connect_timeout: 1
  checkout_timeout: 1
  keepalives: 1
  keepalives_idle: 5
  keepalives_interval: 5
  keepalives_count: 2
  variables:
    statement_timeout: <%= ENV.fetch('DATABASE_STATEMENT_TIMEOUT', '5s') %>

It looks like you're on macOS. What output do you get from sysctl -A | grep net.inet.tcp.*keep

$ sysctl -A | grep "net.inet.tcp.*keep"
net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8
net.inet.tcp.always_keepalive: 0

We’ve been testing by using an SSH tunnel to gain access to the database (it’s in a VPC). We were also able to reproduce in our staging environment (docker + docker swarm on Ubuntu).

ged avatar Jun 05 '19 17:06 ged

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).


I currently have no idea about the cause of this starvation. However you could try to disable ruby’s internal socket waiting and switch to libpq’s waiting by using its synchronous functions. There is an undocumented switch to do this (in a rails initializer or so):

PG::Connection.async_api = false

Asynchronous methods are enabled by default, since they have some advantages. But switching to synchronous methods could show whether it’s a ruby specific issue or something more general.

ged avatar Jun 05 '19 19:06 ged

Original comment by Jeff Wallace (Bitbucket: tjwallace, GitHub: tjwallace).


Thanks @{557058:e7024124-e281-4b49-929e-aa8361e13886} we tried with PG::Connection.async_api = false and still the Sidekiq worker locked up.

ged avatar Jun 05 '19 21:06 ged

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).


And did you check the gdb backtrace? It should no longer contain wait_socket_readable and rb_wait_for_single_fd when using synchronous methods.

ged avatar Jun 06 '19 05:06 ged

A lot of IO code has changed in both ruby-pg and in ruby, so I'll close this issue now.

larskanis avatar Oct 11 '22 11:10 larskanis