ruby-pg
ruby-pg copied to clipboard
wait_socket_readable locks up forever on DNS failover
Original report by Jeff Wallace (Bitbucket: tjwallace, GitHub: tjwallace).
We've been testing failover with PostgresQL on Aiven, and noticed that when DNS failover happens, active queries get stuck. It looks like it's coming from wait_socket_readable and then rb_wait_for_single_fd. Here is some debugging output: https://gist.github.com/tjwallace/ddd12ae8ffb03c248f8aa22af9ab8789.
I’m not sure if this is a ruby-pg issue or a ruby issue, but thought I’d drop it here as a start.
Original comment by Chris Bandy (Bitbucket: cbandy, GitHub: cbandy).
Jeff, when you say “DNS failover” do you mean that PostgreSQL has failed over to a replica and also a DNS record has changed?
- How long did you wait for this function to return?
- Are you setting any TCP keepalive parameters?
- It looks like you're on macOS. What output do you get from
sysctl -A | grep net.inet.tcp.*keep?
Original comment by Jeff Wallace (Bitbucket: tjwallace, GitHub: tjwallace).
We are simulating a master failover by upgrading our database instance. This causes the DNS record to change. Here are the docs.
How long did you wait for this function to return?
Around 50 minutes. The lockup only seems to only happen in our Sidekiq workers (not the Rails/Puma instances), and only when the workers are under constant load during the failover.
Are you setting any TCP keepalive parameters?
We've tried with and without keepalive parameters.
default: &default
adapter: postgresql
encoding: unicode
# For details on connection pooling, see Rails configuration guide
# http://guides.rubyonrails.org/configuring.html#database-pooling
pool: <%= ENV.fetch('RAILS_MAX_THREADS') %>
url: <%= ENV.fetch('DATABASE_URL') %>
connect_timeout: 1
checkout_timeout: 1
keepalives: 1
keepalives_idle: 5
keepalives_interval: 5
keepalives_count: 2
variables:
statement_timeout: <%= ENV.fetch('DATABASE_STATEMENT_TIMEOUT', '5s') %>
It looks like you're on macOS. What output do you get from
sysctl -A | grep net.inet.tcp.*keep
$ sysctl -A | grep "net.inet.tcp.*keep"
net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8
net.inet.tcp.always_keepalive: 0
We’ve been testing by using an SSH tunnel to gain access to the database (it’s in a VPC). We were also able to reproduce in our staging environment (docker + docker swarm on Ubuntu).
Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).
I currently have no idea about the cause of this starvation. However you could try to disable ruby’s internal socket waiting and switch to libpq’s waiting by using its synchronous functions. There is an undocumented switch to do this (in a rails initializer or so):
PG::Connection.async_api = false
Asynchronous methods are enabled by default, since they have some advantages. But switching to synchronous methods could show whether it’s a ruby specific issue or something more general.
Original comment by Jeff Wallace (Bitbucket: tjwallace, GitHub: tjwallace).
Thanks @{557058:e7024124-e281-4b49-929e-aa8361e13886} we tried with PG::Connection.async_api = false and still the Sidekiq worker locked up.
Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).
And did you check the gdb backtrace? It should no longer contain wait_socket_readable and rb_wait_for_single_fd when using synchronous methods.
A lot of IO code has changed in both ruby-pg and in ruby, so I'll close this issue now.