apnotic icon indicating copy to clipboard operation
apnotic copied to clipboard

Error handling for broken connections

Open Linuus opened this issue 1 year ago • 6 comments

Hi!

We're trying out your gem to send VoIP notifications, using Sidekiq. We are having some issues though with broken connections.

At first we were raising an error in the connection.on(:error) {} callback, like this:

      Apnotic::ConnectionPool.new(connection_config, size: 5) do |connection|
        connection.on(:error) do |exception|
          raise(PushNotification::Error, "Production APNs connection error: #{exception}")
        end
      end

That was a really bad idea since it crashed all of Sidekiq making it restart. We fixed this and now we're just reporting to our error service instead.

      Apnotic::ConnectionPool.new(connection_config, size: 5) do |connection|
        connection.on(:error) do |exception|
          Sentry.capture_exception(exception)
        end
      end

Now, occasionally we get this error reported:

Errno::ECONNRESET: Connection reset by peer
  from openssl (3.2.0) lib/openssl/buffering.rb:211:in `sysread_nonblock'
  from openssl (3.2.0) lib/openssl/buffering.rb:211:in `read_nonblock'
  from net-http2 (0.18.5) lib/net-http2/client.rb:145:in `block in socket_loop'
  from net-http2 (0.18.5) lib/net-http2/client.rb:142:in `loop'
  from net-http2 (0.18.5) lib/net-http2/client.rb:142:in `socket_loop'
  from net-http2 (0.18.5) lib/net-http2/client.rb:114:in `block (2 levels) in ensure_open'

It's reported in the callback and then 60s later we get a timeout here:

    connection_pool(ios_voip_push_token).with do |connection|
      response = connection.push(apnotic_notification(notification, ios_voip_push_token))
      raise(TimeoutError) if response.nil?
      [...]
    end

I guess we can pass a shorter timeout to the push method to lower this timeout, since it seems fairly high.

Anyway, when this happened it started happening a lot. Almost all our pushes got this connection reset error. Our push jobs are not retried, but I don't think this would help either since the connections seems to not be "healed".

Could there be an issue where connections are stuck in a broken state? Or are we supposed to handle these errors differently?

Linuus avatar Apr 11 '24 12:04 Linuus

This happens for us a lot too. We are using push notifications to send print messages to iPads and the connection periodically is reset and we end up having to wait the timeout for it to fail before we can try it again. This is causing a poor experience for our clients.

Is there a way to guarantee the connection is still open before trying to send a new push notification?

avanrielly avatar Nov 05 '24 22:11 avanrielly

I'm experiencing this error after migrating our AWS VPC from an EC2 instance used as a NAT to the AWS NAT Gateway. With this gem, there's an implicit expectation that every hop between your application and the destination of the request would keep a connection open indefinitely. However, this causes issues, for example, when you have a NAT that only maintains connections for up to 350 seconds.

I also noticed that setting an explicit timeout on the Connection seems to have no affect. It alway timed out at 60 seconds no matter what I tried.

Instead of fighting with the underlying http2 client's code, I decided to implement an ExpiryTimedStack that tracks the expiration time of a connection and will cull the expired connections ensuring only valid working connections are returned.

Below is the patch code I'm using to get around this issue. If someone wanted to submit a PR based on this, please be my guest. Otherwise, use at your own risk!

require "timeout"
require "connection_pool"

# A TimedStack that expires stored connections after a given duration.
class ConnectionPool::ExpiryTimedStack < ConnectionPool::TimedStack
  # +expiry_duration+ is the number of seconds after which a connection is considered expired.
  def initialize(size = 0, expiry_duration:, &block)
    super(size, &block)
    @expiry_duration = expiry_duration
  end

  # Override pop to handle the case where fetch_connection returns nil due to expiry.
  def pop(timeout = 0.5, options = {})
    options, timeout = timeout, 0.5 if Hash === timeout
    timeout = options.fetch :timeout, timeout

    deadline = current_time + timeout
    @mutex.synchronize do
      loop do
        raise ConnectionPool::PoolShuttingDownError if @shutdown_block

        if connection_stored?(options)
          connection = fetch_connection(options)
          # Only return if a valid connection is found.
          return connection if connection
          # If fetch_connection returned nil because of expiry, the pool was flushed.
          # Continue the loop to attempt creating a new connection.
        end

        connection = try_create(options)
        return connection if connection

        to_wait = deadline - current_time
        raise ConnectionPool::TimeoutError, "Waited #{timeout} sec" if to_wait <= 0
        @resource.wait(@mutex, to_wait)
      end
    end
  end

  private

  # Override to wrap the connection object with its expiry timestamp.
  def store_connection(obj, options = nil)
    expiry_time = current_time + @expiry_duration
    @que.push({ object: obj, expiry: expiry_time })
  end

  # Override to check the expiry of the connection.
  # If the popped connection is expired, clear the remaining pool and return nil.
  def fetch_connection(options = nil)
    while [email protected]?
      item = @que.pop
      if current_time > item[:expiry]
        # The connection has expired. Flush the rest of the pool.
        shutdown_connections(options)
        return nil
      else
        return item[:object]
      end
    end
    nil
  end

  # Override shutdown to clean up our wrapped connection items.
  def shutdown_connections(options = nil)
    while [email protected]?
      item = @que.pop
      @shutdown_block.call(item[:object]) if @shutdown_block
    end
    @created = 0
  end
end

# ExpiryConnectionPool uses ExpiryTimedStack to manage connections with an expiry.
class ExpiryConnectionPool < ConnectionPool
  DEFAULTS = { size: 5, timeout: 5, expiry_duration: 300 }

  def initialize(options = {}, &block)
    raise ArgumentError, "Connection pool requires a block" unless block
    options = DEFAULTS.merge(options)

    @size = Integer(options.fetch(:size))
    @timeout = options.fetch(:timeout)
    expiry_duration = options.fetch(:expiry_duration)

    @available = ConnectionPool::ExpiryTimedStack.new(@size, expiry_duration: expiry_duration, &block)
    @key = :"pool-#{@available.object_id}"
    @key_count = :"pool-#{@available.object_id}-count"
  end
end

module Apnotic
  class ExpiryConnectionPool
    class << self
      def new(options = {}, pool_options = {})
        raise(LocalJumpError, "a block is needed when initializing an Apnotic::ConnectionPool") unless block_given?

        ::ExpiryConnectionPool.new(pool_options) do
          connection = Apnotic::Connection.new(options)
          yield(connection)
          connection
        end
      end

      def development(options = {}, pool_options = {})
        raise(LocalJumpError, "a block is needed when initializing an Apnotic::ConnectionPool") unless block_given?

        ::ExpiryConnectionPool.new(pool_options) do
          connection = Apnotic::Connection.development(options)
          yield(connection)
          connection
        end
      end
    end
  end
end

Jason733i avatar Mar 28 '25 00:03 Jason733i

We ran into a similar issue with connections breaking and causing failures when we tried to send a push. This is mainly caused by middleboxes like AWS Gateway or proxy servers marking the connection as inactive and dropping it. This results in a poor user experience because when you try to send a push, the connection is broken, causing the push job to fail and require a retry.

Apple recommends reusing the connection for as long as possible:

Reuse a connection as long as possible. In most cases, you can reuse a connection for many hours to days. If your connection is mostly idle, you may send a HTTP2 PING frame after an hour of inactivity. Reusing a connection often results in less bandwidth and CPU consumption.

Based on this I decided to send ping requests to all of idle connections, in our case every 4-minute since the connections go through a proxy server that drops inactive connections every 5 minutes. This completely solved the broken connection issues in both our AWS environment and the environment that was using a proxy.

# in config/initializers/sidekiq.rb
ping_timer_task = Concurrent::TimerTask.new(execution_interval: Rails.configuration.x.apn_ping_interval) do
  # Ping all active APN connections to keep them alive
  PushNotifications::ApnConnectionManager.ping_all_connections
rescue => e
  Rails.logger.error("APN connections ping failed: #{e.message}")
  Rails.logger.error(e.backtrace.join("\n"))
end
ping_timer_task.execute


# In PushNotifications::ApnConnectionManager

# Ping all active APN connections to keep them alive and avoid proxy timeouts
# Uses HTTP/2 PING frames as recommended by Apple's APN documentation
def ping_all_connections
  if @pools.empty?
    Rails.logger.info("APN: No APN connections to ping.")
    return
  end

  @pools.each do |pool_key, pool|
    Rails.logger.info("APN: Pinging APN #{pool_key} ...")
    ping_connection_pool(pool, pool_key)
  end
end


# then loop through APN connections and ping idle connections
# remove the connection from the pool when ping fails
def ping_connection_pool(pool, pool_key)
  successes = 0
  errors = 0
  # @type [ConnectionPool::TimedStack]
  ts = pool.instance_variable_get(:@available)
  if ts.idle.zero?
    Rails.logger.info("APN: No idle APN connections to ping in #{pool_key} pool.")
    return
  end

  # reap method pops the connections that have been idle for apn_ping_interval
  # if the server is busy and all the connections in the pool are active,
  # we won't ping anything as we don't want to block important push notification
  # connections for ping.
  pool.reap(Rails.configuration.x.apn_ping_interval) do |conn|
    pong = ping_connection(conn)
    if pong
      successes += 1
      # Good connection - put it back in the pool
      ts.push(conn)
    else
      errors += 1
      Rails.logger.error("APN: Ping failed on #{conn}. Discarding bad connection ...")
      # Bad connection - don't put it back, just close it
      conn.close
    end
  end

  if successes.positive?
    Rails.logger.info("APN: Pinged #{successes} good idle connection(s), discarded #{errors} bad connection(s) in #{pool_key} pool")
  end

  # zero errors and successes mean reap didn't return any connection that has been idle for apn_ping_interval
  Rails.logger.info("APN: No idle connections in #{pool_key} pool") if errors.zero? && successes.zero?

  errors.zero?
end

# Send an HTTP/2 PING frame to keep the APN connection alive
# This is the recommended approach per Apple's APN documentation
# @param connection [Apnotic::Connection] Apnotic connection to ping
# @param timeout [Integer] Timeout in seconds.
# @see Concurrent::Concern::Obligation#value for more information about Timeout
# @return [Boolean] Returns whether ping was successful or not
def ping_connection(connection, timeout: 5)
  # Access the underlying HTTP/2 client through the net-http2 client
  # The connection.client is a NetHttp2::Client, which has access to the HTTP2::Client
  # @type [NetHttp2::Client]
  client = connection.instance_variable_get(:@client)
  raise "APN: Could not access HTTP/2 client for PING frame" unless client&.private_methods&.include?(:h2)

  # HTTP 2 ping method is async, we have to wait for the PONG before returning
  # We use concurrent-ruby Promise and IVar to achieve that
  promise = Concurrent::Promise.new do
    # @type [HTTP2::Client]
    http2_client = client.send(:h2)
    # We can wait for an IVar to be assigned
    result = Concurrent::IVar.new

    # Generate a random 8-byte payload for the PING frame
    payload = SecureRandom.random_bytes(8)

    # Send the PING frame with a callback to handle the PONG response
    http2_client.ping(payload) { |_pong_payload| result.set(true) }

    # This waits for the value of IVar to be assigned or times out and returns false
    result.value(timeout) || false
  end

  promise.execute.value(timeout) || false
end

If the gem maintainers @ostinelli, @benubois agree, I can open a PR to add ping method to Apnotic::Connection which would simply delegate to the underlying NetHttp2::Client#ping. It might also be useful to document this in README, most people running this on AWS will run into similar issues.

We would also need the connection_pool gem to provide access to the underlying ConnectionPool::TimedStack object, so we can return healthy, responsive connections back to it.

mmalek-sa avatar Aug 23 '25 17:08 mmalek-sa

Interesting findings!

Exposing ping sounds fine to me, but the whole setup looks like it would require some configuration and potentially a separate process?

A solution that could repair the connection or require little to no configuration up front would be ideal, but I understand if that's not feasible. Did you try @Jason733i's connection expiring patch above? I would be curious if that also fixes it for you and if so could that be a way to have a full solution integrated into apnotic.

Thank you.

benubois avatar Aug 24 '25 15:08 benubois

Did you try @Jason733i's connection expiring patch above?

Using @Jason733i ExpiryTimedStack is simpler, but it has these disadvantages:

  • Depending on how often middle-boxes drop idle connections, you can end up with short-lived connections. In my case, there is a proxy server that drops idle connections after 5 minutes.
  • Creating a new connection to APN service when there is no healthy available connection has a slight performance penalty and can impact the end-user experience when they are waiting for a push. Keeping the connection alive doesn't have this problem.
  • Apple recommends keeping the connections alive as long as possible (for days) using PING.

the whole setup looks like it would require some configuration and potentially a separate process?

I'm using the main Sidekiq process and a new thread via Concurrent::TimerTask. It would be ideal if the gem could schedule this PING task and timer whenever we create a new connection pool.

A solution that could repair the connection or require little to no configuration up front would be ideal.

That works, but if a middle-box drops the connection completely, what would repair mean? Creating a new connection? If that's the case, I'd prefer keeping the connection healthy using PING in the first place.

mmalek-sa avatar Aug 25 '25 03:08 mmalek-sa

Ok! That all makes sense.

I think you should do what you think is best, and I'd be happy to accept anything that is better than what is there now. Priority wise I'm interested in:

  • No required configuration (but it would make sense to expose configuration where necessary)
  • No hard requirement on sidekiq or rails

Appreciate your work on this.

benubois avatar Aug 25 '25 04:08 benubois