rails icon indicating copy to clipboard operation
rails copied to clipboard

Action Cable: Client-initiated heartbeats

Open jeremy opened this issue 2 years ago • 2 comments

Tracking issue for a new feature we're extracting from Basecamp: Relying on clients to send heartbeats.

Motivation: Preventing defunct session cruft from building up on WebSockets load balancers / proxies (e.g. Nginx or F5 BIG-IP) during network failovers and anycast IP changes.

  • Basecamp runs in multiple datacenters for high availability. During failover events and network changes, existing Action Cable sessions are severed. Hence they stop receiving server-sent heartbeats and attempt a reconnect, now landing on the new network destination. All good, as expected.
  • On the server side, however, a proxy or load balancer sitting in front of the app can accumulate all these indeterminate maybe-dead NATted connections, eating a ton of memory and sockets. The proxy needs to ceremoniously tear down the suddenly-absent client TCP connection, but in the meantime it's getting heartbeat pings from the server side, suggesting the connection is alive and well, so it hangs onto all the connections until the client side is surely closed.

Shifting the heartbeat responsibility from the server → client to client → server neatly resolves this. When network routing changes, the client reconnects to the new destination and ceases heartbeats to the old one. The proxy at the old destination no longer sees client or server traffic on the connection, so it gracefully & expeditiously closes it out.

This feels like a strong default behavior as well, considering the client is already responsible for all other aspects of connection management.

Needs some care with backward and forward compatibility, anticipating that one, both, or neither heartbeating mechanisms could be in play during initial deployments or rollback therefrom.

Implementation:

  • Introduce native heartbeating to the connection monitor
  • Introduce a native ping/heartbeat message
  • Skip ActionCable::Server::Base#setup_heartbeat_timer when client-initiated heartbeats are enabled

Example from Basecamp, implemented using a Cable channel to "drive" the connection monitor:

BC.cableReady(function() {
  const pingInterval = (ActionCable.ConnectionMonitor.staleThreshold * 1000) / 2 // 3 seconds

  BC.cableMonitor = BC.cable.subscriptions.create("MonitoringChannel", {
    initialized() {
      ({monitor: this.monitor} = this.consumer.connection)
      this.ping = this.ping.bind(this)
    },

    connected() {
      this.monitor.recordConnect()
      return this.ping()
    },

    received(data) {
      switch (data.action) {
        case "pong":
          return this.pong()
        case "pubsub_pong":
          return this.pong()
      }
    },

    ping() {
      this.perform("ping")
      return this.schedulePing()
    },

    pong() {
      this.monitor.recordPing()
      return this.schedulePing()
    },

    schedulePing() {
      clearTimeout(this.scheduledPing)
      this.scheduledPing = setTimeout(this.ping, pingInterval)
    }
  }
  )
})
require "securerandom"
require "benchmark"

# Send our own pings and answer monitoring questions.
#
# This gives us forward compat with Action Cable protocol changes,
# like pings changing from subscriptions to message types.
class MonitoringChannel < ApplicationCable::Channel
  def subscribed
    @subscription_uuid = SecureRandom.uuid
    @last_received_at = Time.now.to_f

    stream_for @subscription_uuid, ->(json) {
      instrument_pubsub_latency ActiveSupport::JSON.decode(json) do |message|
        transmit message
      end
    }
  end

  def ping
    transmit({ action: "pong" })
  end

  def pubsub_ping
    self.class.broadcast_to @subscription_uuid, action: "pubsub_pong", sent_at: Time.now.to_f
  end

  private
    def instrument_pubsub_latency(message)
      received_at = Time.now.to_f

      if sent_at = message.delete("sent_at")
        latency = received_at - sent_at.to_f
        message["latency_ms"] = ms(latency)
      end

      message["period_ms"] = ms(received_at - @last_received_at)
      @last_received_at = received_at

      ActiveSupport::Notifications.instrument :performance, measurement: "Chat.pubsub_delay", value: latency, action: :timing

      yield message if block_given?
      message
    end

    def ms(seconds)
      (1000 * seconds).round(2)
    end
end

jeremy avatar May 16 '22 20:05 jeremy

Hey @Jeremy!

Thanks for the detailed explanation. That's an interesting problem to solve)

I've been trying to model what's going on at the networking (TCP) layer. Here are my thoughts.

a proxy or load balancer sitting in front of the app can accumulate all these indeterminate maybe-dead NATted connections ... so it hangs onto all the connections until the client side is surely closed.

Clients disconnect without a proper closing handshake (no FIN sent). We can only detect the failure by trying to write some data (for example, by sending pings). And that's where TCP retransmission mechanism takes the stage: depending on the tcp_retries2 kernel setting, it could take minutes to detect a broken connection (here is a good article showing some numbers).

Shifting the heartbeat responsibility from the server → client to client → server neatly resolves this... The proxy at the old destination no longer sees client or server traffic on the connection, so it gracefully & expeditiously closes it out

How does "no traffic" leads to closing a connection? Is it a proxy-specific feature? (TCP doesn't care about the presence or absence of the traffic). In general, without any traffic, only the TCP keep-alive mechanism could detect a failure (and that also could take minutes depending on the OS settings).


We can not really rely on OS-level broken connection detection mechanism (retransmissions, keep-alive), since the default settings don't fit our use case (and tuning them is not always possible).

Thus, we should think of an application-level heartbeat implementation, which would allow us to track failures quicker. I'd suggest to consider enhancing the current PING functionality with the client-server PONG. How this could help? Whenever a server sends a PING message, it configures a read timeout* for the socket; if no message has been received, we consider the connection broken and close it.

* We can implement read timeout by adding a #read_deadline information to the Stream object during PING writes (or any writes?) and checking for it in the select loop.

P.S. The server-client PING-PONG+Read deadline approach is used by, for example, SocketIO and Centrifugo.

palkan avatar May 23 '22 23:05 palkan

This issue has been automatically marked as stale because it has not been commented on for at least three months. The resources of the Rails team are limited, and so we are asking for your help. If you can still reproduce this error on the 7-0-stable branch or on main, please reply with all of the information you have about it in order to keep the issue open. Thank you for all your contributions.

rails-bot[bot] avatar Aug 22 '22 00:08 rails-bot[bot]