Error handling for broken connections
Hi!
We're trying out your gem to send VoIP notifications, using Sidekiq. We are having some issues though with broken connections.
At first we were raising an error in the connection.on(:error) {} callback, like this:
Apnotic::ConnectionPool.new(connection_config, size: 5) do |connection|
connection.on(:error) do |exception|
raise(PushNotification::Error, "Production APNs connection error: #{exception}")
end
end
That was a really bad idea since it crashed all of Sidekiq making it restart. We fixed this and now we're just reporting to our error service instead.
Apnotic::ConnectionPool.new(connection_config, size: 5) do |connection|
connection.on(:error) do |exception|
Sentry.capture_exception(exception)
end
end
Now, occasionally we get this error reported:
Errno::ECONNRESET: Connection reset by peer
from openssl (3.2.0) lib/openssl/buffering.rb:211:in `sysread_nonblock'
from openssl (3.2.0) lib/openssl/buffering.rb:211:in `read_nonblock'
from net-http2 (0.18.5) lib/net-http2/client.rb:145:in `block in socket_loop'
from net-http2 (0.18.5) lib/net-http2/client.rb:142:in `loop'
from net-http2 (0.18.5) lib/net-http2/client.rb:142:in `socket_loop'
from net-http2 (0.18.5) lib/net-http2/client.rb:114:in `block (2 levels) in ensure_open'
It's reported in the callback and then 60s later we get a timeout here:
connection_pool(ios_voip_push_token).with do |connection|
response = connection.push(apnotic_notification(notification, ios_voip_push_token))
raise(TimeoutError) if response.nil?
[...]
end
I guess we can pass a shorter timeout to the push method to lower this timeout, since it seems fairly high.
Anyway, when this happened it started happening a lot. Almost all our pushes got this connection reset error. Our push jobs are not retried, but I don't think this would help either since the connections seems to not be "healed".
Could there be an issue where connections are stuck in a broken state? Or are we supposed to handle these errors differently?
This happens for us a lot too. We are using push notifications to send print messages to iPads and the connection periodically is reset and we end up having to wait the timeout for it to fail before we can try it again. This is causing a poor experience for our clients.
Is there a way to guarantee the connection is still open before trying to send a new push notification?
I'm experiencing this error after migrating our AWS VPC from an EC2 instance used as a NAT to the AWS NAT Gateway. With this gem, there's an implicit expectation that every hop between your application and the destination of the request would keep a connection open indefinitely. However, this causes issues, for example, when you have a NAT that only maintains connections for up to 350 seconds.
I also noticed that setting an explicit timeout on the Connection seems to have no affect. It alway timed out at 60 seconds no matter what I tried.
Instead of fighting with the underlying http2 client's code, I decided to implement an ExpiryTimedStack that tracks the expiration time of a connection and will cull the expired connections ensuring only valid working connections are returned.
Below is the patch code I'm using to get around this issue. If someone wanted to submit a PR based on this, please be my guest. Otherwise, use at your own risk!
require "timeout"
require "connection_pool"
# A TimedStack that expires stored connections after a given duration.
class ConnectionPool::ExpiryTimedStack < ConnectionPool::TimedStack
# +expiry_duration+ is the number of seconds after which a connection is considered expired.
def initialize(size = 0, expiry_duration:, &block)
super(size, &block)
@expiry_duration = expiry_duration
end
# Override pop to handle the case where fetch_connection returns nil due to expiry.
def pop(timeout = 0.5, options = {})
options, timeout = timeout, 0.5 if Hash === timeout
timeout = options.fetch :timeout, timeout
deadline = current_time + timeout
@mutex.synchronize do
loop do
raise ConnectionPool::PoolShuttingDownError if @shutdown_block
if connection_stored?(options)
connection = fetch_connection(options)
# Only return if a valid connection is found.
return connection if connection
# If fetch_connection returned nil because of expiry, the pool was flushed.
# Continue the loop to attempt creating a new connection.
end
connection = try_create(options)
return connection if connection
to_wait = deadline - current_time
raise ConnectionPool::TimeoutError, "Waited #{timeout} sec" if to_wait <= 0
@resource.wait(@mutex, to_wait)
end
end
end
private
# Override to wrap the connection object with its expiry timestamp.
def store_connection(obj, options = nil)
expiry_time = current_time + @expiry_duration
@que.push({ object: obj, expiry: expiry_time })
end
# Override to check the expiry of the connection.
# If the popped connection is expired, clear the remaining pool and return nil.
def fetch_connection(options = nil)
while [email protected]?
item = @que.pop
if current_time > item[:expiry]
# The connection has expired. Flush the rest of the pool.
shutdown_connections(options)
return nil
else
return item[:object]
end
end
nil
end
# Override shutdown to clean up our wrapped connection items.
def shutdown_connections(options = nil)
while [email protected]?
item = @que.pop
@shutdown_block.call(item[:object]) if @shutdown_block
end
@created = 0
end
end
# ExpiryConnectionPool uses ExpiryTimedStack to manage connections with an expiry.
class ExpiryConnectionPool < ConnectionPool
DEFAULTS = { size: 5, timeout: 5, expiry_duration: 300 }
def initialize(options = {}, &block)
raise ArgumentError, "Connection pool requires a block" unless block
options = DEFAULTS.merge(options)
@size = Integer(options.fetch(:size))
@timeout = options.fetch(:timeout)
expiry_duration = options.fetch(:expiry_duration)
@available = ConnectionPool::ExpiryTimedStack.new(@size, expiry_duration: expiry_duration, &block)
@key = :"pool-#{@available.object_id}"
@key_count = :"pool-#{@available.object_id}-count"
end
end
module Apnotic
class ExpiryConnectionPool
class << self
def new(options = {}, pool_options = {})
raise(LocalJumpError, "a block is needed when initializing an Apnotic::ConnectionPool") unless block_given?
::ExpiryConnectionPool.new(pool_options) do
connection = Apnotic::Connection.new(options)
yield(connection)
connection
end
end
def development(options = {}, pool_options = {})
raise(LocalJumpError, "a block is needed when initializing an Apnotic::ConnectionPool") unless block_given?
::ExpiryConnectionPool.new(pool_options) do
connection = Apnotic::Connection.development(options)
yield(connection)
connection
end
end
end
end
end
We ran into a similar issue with connections breaking and causing failures when we tried to send a push. This is mainly caused by middleboxes like AWS Gateway or proxy servers marking the connection as inactive and dropping it. This results in a poor user experience because when you try to send a push, the connection is broken, causing the push job to fail and require a retry.
Apple recommends reusing the connection for as long as possible:
Reuse a connection as long as possible. In most cases, you can reuse a connection for many hours to days. If your connection is mostly idle, you may send a HTTP2 PING frame after an hour of inactivity. Reusing a connection often results in less bandwidth and CPU consumption.
Based on this I decided to send ping requests to all of idle connections, in our case every 4-minute since the connections go through a proxy server that drops inactive connections every 5 minutes. This completely solved the broken connection issues in both our AWS environment and the environment that was using a proxy.
# in config/initializers/sidekiq.rb
ping_timer_task = Concurrent::TimerTask.new(execution_interval: Rails.configuration.x.apn_ping_interval) do
# Ping all active APN connections to keep them alive
PushNotifications::ApnConnectionManager.ping_all_connections
rescue => e
Rails.logger.error("APN connections ping failed: #{e.message}")
Rails.logger.error(e.backtrace.join("\n"))
end
ping_timer_task.execute
# In PushNotifications::ApnConnectionManager
# Ping all active APN connections to keep them alive and avoid proxy timeouts
# Uses HTTP/2 PING frames as recommended by Apple's APN documentation
def ping_all_connections
if @pools.empty?
Rails.logger.info("APN: No APN connections to ping.")
return
end
@pools.each do |pool_key, pool|
Rails.logger.info("APN: Pinging APN #{pool_key} ...")
ping_connection_pool(pool, pool_key)
end
end
# then loop through APN connections and ping idle connections
# remove the connection from the pool when ping fails
def ping_connection_pool(pool, pool_key)
successes = 0
errors = 0
# @type [ConnectionPool::TimedStack]
ts = pool.instance_variable_get(:@available)
if ts.idle.zero?
Rails.logger.info("APN: No idle APN connections to ping in #{pool_key} pool.")
return
end
# reap method pops the connections that have been idle for apn_ping_interval
# if the server is busy and all the connections in the pool are active,
# we won't ping anything as we don't want to block important push notification
# connections for ping.
pool.reap(Rails.configuration.x.apn_ping_interval) do |conn|
pong = ping_connection(conn)
if pong
successes += 1
# Good connection - put it back in the pool
ts.push(conn)
else
errors += 1
Rails.logger.error("APN: Ping failed on #{conn}. Discarding bad connection ...")
# Bad connection - don't put it back, just close it
conn.close
end
end
if successes.positive?
Rails.logger.info("APN: Pinged #{successes} good idle connection(s), discarded #{errors} bad connection(s) in #{pool_key} pool")
end
# zero errors and successes mean reap didn't return any connection that has been idle for apn_ping_interval
Rails.logger.info("APN: No idle connections in #{pool_key} pool") if errors.zero? && successes.zero?
errors.zero?
end
# Send an HTTP/2 PING frame to keep the APN connection alive
# This is the recommended approach per Apple's APN documentation
# @param connection [Apnotic::Connection] Apnotic connection to ping
# @param timeout [Integer] Timeout in seconds.
# @see Concurrent::Concern::Obligation#value for more information about Timeout
# @return [Boolean] Returns whether ping was successful or not
def ping_connection(connection, timeout: 5)
# Access the underlying HTTP/2 client through the net-http2 client
# The connection.client is a NetHttp2::Client, which has access to the HTTP2::Client
# @type [NetHttp2::Client]
client = connection.instance_variable_get(:@client)
raise "APN: Could not access HTTP/2 client for PING frame" unless client&.private_methods&.include?(:h2)
# HTTP 2 ping method is async, we have to wait for the PONG before returning
# We use concurrent-ruby Promise and IVar to achieve that
promise = Concurrent::Promise.new do
# @type [HTTP2::Client]
http2_client = client.send(:h2)
# We can wait for an IVar to be assigned
result = Concurrent::IVar.new
# Generate a random 8-byte payload for the PING frame
payload = SecureRandom.random_bytes(8)
# Send the PING frame with a callback to handle the PONG response
http2_client.ping(payload) { |_pong_payload| result.set(true) }
# This waits for the value of IVar to be assigned or times out and returns false
result.value(timeout) || false
end
promise.execute.value(timeout) || false
end
If the gem maintainers @ostinelli, @benubois agree, I can open a PR to add ping method to Apnotic::Connection which would simply delegate to the underlying NetHttp2::Client#ping. It might also be useful to document this in README, most people running this on AWS will run into similar issues.
We would also need the connection_pool gem to provide access to the underlying ConnectionPool::TimedStack object, so we can return healthy, responsive connections back to it.
Interesting findings!
Exposing ping sounds fine to me, but the whole setup looks like it would require some configuration and potentially a separate process?
A solution that could repair the connection or require little to no configuration up front would be ideal, but I understand if that's not feasible. Did you try @Jason733i's connection expiring patch above? I would be curious if that also fixes it for you and if so could that be a way to have a full solution integrated into apnotic.
Thank you.
Did you try @Jason733i's connection expiring patch above?
Using @Jason733i ExpiryTimedStack is simpler, but it has these disadvantages:
- Depending on how often middle-boxes drop idle connections, you can end up with short-lived connections. In my case, there is a proxy server that drops idle connections after 5 minutes.
- Creating a new connection to APN service when there is no healthy available connection has a slight performance penalty and can impact the end-user experience when they are waiting for a push. Keeping the connection alive doesn't have this problem.
- Apple recommends keeping the connections alive as long as possible (for days) using PING.
the whole setup looks like it would require some configuration and potentially a separate process?
I'm using the main Sidekiq process and a new thread via Concurrent::TimerTask. It would be ideal if the gem could schedule this PING task and timer whenever we create a new connection pool.
A solution that could repair the connection or require little to no configuration up front would be ideal.
That works, but if a middle-box drops the connection completely, what would repair mean? Creating a new connection? If that's the case, I'd prefer keeping the connection healthy using PING in the first place.
Ok! That all makes sense.
I think you should do what you think is best, and I'd be happy to accept anything that is better than what is there now. Priority wise I'm interested in:
- No required configuration (but it would make sense to expose configuration where necessary)
- No hard requirement on sidekiq or rails
Appreciate your work on this.