amplify-swift icon indicating copy to clipboard operation
amplify-swift copied to clipboard

AppSync Subscription Disconnects After 2-3 Minutes (NSPOSIXErrorDomain Code 57)

Open priva-byte opened this issue 7 months ago • 19 comments

Describe the bug

AppSync GraphQL subscriptions established using Amplify.API.subscribe with a custom GraphQLRequest consistently disconnect after a specific period of inactivity (no messages received or sent over the subscription). We observe this reliably when the app is in the foreground and the subscription is active but idle. Our backend setup, involving a custom mutation handled by a Lambda function triggering a custom subscription, follows the patterns described in the Amplify Gen 2 documentation for this type of real-time flow.

The disconnection manifests as a failure in the Combine sink's receiveCompletion block, reporting APIError: Subscription item event failed with error caused by the underlying error Error Domain=NSPOSIXErrorDomain Code=57 "Socket is not connected". The timing of this disconnection appears highly consistent: it occurs almost exactly at 120 seconds (2 minutes) of inactivity, or, less frequently but still precisely, at exactly 180 seconds (3 minutes) of inactivity.

This issue occurs when using the latest stable Amplify Swift libraries, even with simplified subscription lifecycle code. Because the basic setup failed, we subsequently implemented custom client-side retry logic (attempting to re-establish the subscription upon detecting the Code 57 error), however, this retry mechanism itself failed to function correctly. While the retry logic successfully initiates a new connection attempt (indicated by logging the .connection(.connecting) state), this reconnection attempt subsequently fails silently. It never reaches the .connection(.connected) state, nor does the Combine sink's receiveCompletion handler get called with a success or failure for this second attempt. We frequently observed [AWSGraphQLSubscriptionTaskRunner] Failed to unsubscribe logs during these disconnection events, suggesting the SDK's internal state prevented a clean teardown and re-initialization of the subscription, rendering the retry logic ineffective. The SDK's built-in WebSocket keep-alive mechanism does not seem to prevent the initial, specific, timed disconnection. We believe network connectivity is generally good when this occurs. Restarting the app allows reconnection, but the idle disconnect happens again after the same exact interval.

Steps To Reproduce

  1. Configure an AWS AppSync backend with: A custom GraphQL mutation (e.g., publishMessage) handled by a Lambda function. A custom GraphQL subscription (e.g., subscribeToMessages) linked to the mutation via .for(a.ref('publishMessage')) in the schema definition (resource.ts).
  2. In the iOS client app (using Amplify Swift SDK): Configure Amplify with AWSAPIPlugin. Implement code to subscribe to the custom subscribeToMessages using Amplify.API.subscribe and a GraphQLRequest.
  3. Run the app and successfully establish the subscription (observe .connection(.connected) state is reached).
  4. Keep the app in the foreground and ensure the subscription remains active.
  5. Do not trigger the publishMessage mutation (i.e., ensure no events are sent over the subscription, allowing it to remain idle).
  6. Wait for approximately 2 to 3 minutes.
  7. Observe that the Combine sink's receiveCompletion handler associated with the subscription receives a .failure event containing the NSPOSIXErrorDomain Code=57 "Socket is not connected" error.

Expected behavior

The WebSocket connection for the AppSync GraphQL subscription should remain active indefinitely (as long as the client maintains the subscription) via the SDK's automatic keep-alive mechanisms, without disconnecting due to short periods (2-3 minutes) of inactivity.

Amplify Framework Version

2.46.1

Amplify Categories

API

Dependency manager

Swift PM

Swift version

5

CLI version

1.5.0

Xcode version

16.2.0

Is this a regression?

No

Platforms

iOS

OS Version

iOS 18.3

Device

iPhone16Pro

priva-byte avatar Apr 08 '25 09:04 priva-byte

Thank you for posting this - our team will take a look and provide updates here.

thisisabhash avatar Apr 08 '25 15:04 thisisabhash

@priva-byte Would you be able to share this implementation that you talked about in your description?

we subsequently implemented custom client-side retry logic

If you are trying to reuse subscription components, would you be able to fully tear down and recreate all subscription-related objects?

Meanwhile I will try to recreate this issue locally and get back to you.

harsh62 avatar Apr 09 '25 02:04 harsh62

@priva-byte Also created a small fix to handle the error gracefully. Hopefully you'll be able to resubscribe without any errors.

Would you be able to test with and validate if it fixes your issue? You can use the WIP branch in the attached PR.

harsh62 avatar Apr 09 '25 02:04 harsh62

@harsh62 Thanks for providing the test branch! We've integrated it and have feedback based on our testing.

First, we're consistently observing the subscription disconnect after approximately 2-3 minutes of inactivity (no messages sent/received), even when the app is in the foreground. Could you clarify if this is expected behavior from AppSync, perhaps due to an idle timeout in the service or underlying infrastructure?

Regarding the fix branch, the behavior after the initial disconnect is inconsistent:

Sometimes Success

On some occasions, after the connection drops (now reported as NSURLErrorDomain Code=-1005), the reconnection attempt initiated by our client-side logic succeeds, reaches .connected, and the app recovers.

// ... Disconnect with NSURLErrorDomain Code=-1005 ...
[2025-04-09 04:30:25 +0000] Amplify.API.subscribe call succeeded
// ... Sink established ...
[2025-04-09 04:30:27 +0000] Combine subscription connection state changed: connected
[2025-04-09 04:30:27 +0000] Detected successful reconnection

Sometimes Failure

However, on other attempts, even with the fix branch, we still encounter the original failure mode. The reconnection attempt starts, Amplify.API.subscribe succeeds synchronously, the sink reports .connecting, but then shortly after, the internal [RetryWithJitter] operation failed with error timeout occurs, and the connection never reaches .connected.

// ... Disconnect with NSURLErrorDomain Code=-1005 ...
[2025-04-09 04:33:27 +0000] Amplify.API.subscribe call succeeded
// ... Sink established ...
[2025-04-09 04:33:27 +0000] Combine subscription connection state changed: connecting
[RetryWithJitter] operation failed with error timeout, retrying(0)
// --- Never reaches .connected state ---

Further Inconsistency

We've observed sequences like: Successful Reconnect -> Disconnect -> Failed Reconnect (RetryWithJitter) -> Disconnect -> Successful Reconnect. The pattern isn't predictable.

Retry Logic

When the Combine sink for our subscription fails, our client logic schedules an asynchronous retry (if the subscription is still desired). This retry task first applies a brief delay, then ensures the previous Combine sink and Amplify sequence are cancelled/cleaned up. Finally, it calls Amplify.API.subscribe again to create a new sequence and sink. The eventual connection status (.connected or further failure) is then handled asynchronously by this new sink. Our implementation ensures that for each retry attempt, the previous Combine sink and Amplify sequence are explicitly cancelled before Amplify.API.subscribe is called again to generate entirely new subscription resources, rather than attempting to reuse any failed components.

The behavior remains inconsistent (sometimes reconnects, sometimes fails with RetryWithJitter). We are also curious if the 2-3 minute idle disconnect is expected AppSync behavior.

Thanks for looking into this!

priva-byte avatar Apr 09 '25 05:04 priva-byte

@priva-byte Would you be able to share code snippets so that I can try to recreate this locally?

harsh62 avatar Apr 09 '25 05:04 harsh62

@harsh62 Thanks for looking into this. below is a simplified version of our MessageRepository focusing only on the subscription setup and retry logic:

public struct PubSubMessage: Embeddable {
  var channelId: String
  var messageId: String
}

struct SubscribeToMessages: Codable {
  let subscribeToMessages: PubSubMessage?
}

extension SubscribeToMessages {
  static var document: String {
    """
    subscription SubscribeToMessages($channelIds: [String!]!) {
      subscribeToMessages(channelIds: $channelIds) {
        channelId
        messageId
      }
    }
    """
  }
}

enum RetryUtility {
  private static let logger = Logger(subsystem: "com.our.app", category: "Retry")

  static func executeWithRetry<T>(
    maxRetries: Int = 3,
    maxBackoffMs: UInt64 = 5000,
    operationName: String,
    shouldRetry: ((Error) -> Bool)? = nil,
    operation: () async throws -> T
  ) async throws -> T {
    var retryCount = 0
    while true {
      do {
        return try await operation()
      } catch {
        let canRetry = shouldRetry?(error) ?? true
        retryCount += 1
        logger.error(
          "Failed \(operationName) (attempt \(retryCount)/\(maxRetries)), Error: \(error)")
        if !canRetry || retryCount >= maxRetries {
          logger.error("Max retries or non-retryable error for \(operationName)")
          throw error
        }

        let baseBackoff = min(200 * pow(2.0, Double(retryCount)), Double(maxBackoffMs))
        let jitter = Double.random(in: 0..<300)
        let backoffMs = UInt64(baseBackoff + jitter)
        logger.debug("Retrying \(operationName) in \(backoffMs)ms...")
        try await Task.sleep(nanoseconds: backoffMs * 1_000_000)
      }
    }
  }
}

import AWSAPIPlugin
import AWSPluginsCore
import Amplify
import Combine
import Foundation
import os.log

struct ReconnectionAbortedError: Error {}

class MessageRepository {
  private var subscriptionSequence:
    AmplifyAsyncThrowingSequence<
      GraphQLSubscriptionEvent<SubscribeToMessages>
    >?
  private var subscriptionCancellable: AnyCancellable?
  private var lastAttemptedChannelIds: [String] = []
  private var reconnectionAttempt: Int = 0

  private enum ConnectionState {
    case disconnected
    case connecting
    case connected
  }
  private var connectionState: ConnectionState = .disconnected

  init() {}

  func subscribeToMessages(channelIds: [String]) throws {
    print(
      "[\(Date())] Requesting new subscription for channels: \(channelIds)."
    )
    self.lastAttemptedChannelIds = channelIds
    cancelSubscriptionInternal(scheduleRetry: false)

    guard !channelIds.isEmpty else {
      print("[\(Date())] No channel IDs provided.")
      return
    }

    print(
      "[\(Date())] Starting subscription attempt #\(reconnectionAttempt + 1) for \(channelIds)"
    )

    print("[\(Date())] Attempting Amplify.API.subscribe...")
    let sequence:
      AmplifyAsyncThrowingSequence<
        GraphQLSubscriptionEvent<SubscribeToMessages>
      >

    sequence = Amplify.API.subscribe(
      request: GraphQLRequest<SubscribeToMessages>(
        document: SubscribeToMessages.document,
        variables: ["channelIds": channelIds],
        responseType: SubscribeToMessages.self
      )
    )

    print(
      "[\(Date())] Amplify.API.subscribe call succeeded (sequence created)."
    )

    self.subscriptionSequence = sequence
    self.updateConnectionState(.connecting)

    self.subscriptionCancellable = Amplify.Publisher.create(sequence)
      .sink(
        receiveCompletion: { [weak self] completion in
          guard let self = self else { return }
          self.updateConnectionState(.disconnected)

          switch completion {
          case .finished:
            print(
              "[\(Date())] Combine subscription finished normally"
            )
          case .failure(let error):
            print(
              "[\(Date())] Combine subscription failed: \(error)"
            )

            if !(error is ReconnectionAbortedError) {
              self.scheduleReconnectionAttempt()
            }
          }

          var shouldScheduleRetry = false
          if case .failure(let error) = completion {
            shouldScheduleRetry = !(error is ReconnectionAbortedError)
          }
          self.cancelSubscriptionInternal(scheduleRetry: shouldScheduleRetry)
        },
        receiveValue: { [weak self] event in
          guard let self = self else { return }
          switch event {
          case .connection(let state):
            print(
              "[\(Date())] Connection state changed: \(state)")
            let newState: ConnectionState =
              (state == .connected) ? .connected : .connecting
            self.updateConnectionState(newState)

            if state == .connected {
              print("[\(Date())] Subscription connected")
              self.reconnectionAttempt = 0
              print(
                "[\(Date())] Resetting reconnection attempts")
            }
          case .data(let result):
            print(
              "[\(Date())] Received data event (content omitted)"
            )
          }
        }
      )
    print("[\(Date())] Combine sink established.")
  }

  func cancelSubscription() {
    print("[\(Date())] Cancelling subscription MANUALLY...")
    self.lastAttemptedChannelIds = []
    cancelSubscriptionInternal(scheduleRetry: false)
  }

  private func cancelSubscriptionInternal(scheduleRetry: Bool) {
    print(
      "[\(Date())] Performing internal cleanup. Schedule Retry: \(scheduleRetry)"
    )
    self.updateConnectionState(.disconnected)

    subscriptionCancellable?.cancel()
    subscriptionCancellable = nil

    let sequenceToCancel = self.subscriptionSequence
    Task {
      if sequenceToCancel != nil {
        self.subscriptionSequence = nil
        sequenceToCancel?.cancel()
        print(
          "[\(Date())] Amplify sequence cancellation requested.")
      }
    }

    if !scheduleRetry {
      print(
        "[\(Date())] Not scheduling retry. Clearing channel state."
      )
      self.lastAttemptedChannelIds = []
      self.reconnectionAttempt = 0
    }
    print("[\(Date())] Internal cleanup finished.")
  }

  private func scheduleReconnectionAttempt() {
    guard !lastAttemptedChannelIds.isEmpty else {
      print("[\(Date())] No channels to reconnect to. Aborting.")
      reconnectionAttempt = 0
      return
    }
    reconnectionAttempt += 1
    print(
      "[\(Date())] Scheduling reconnection attempt #\(reconnectionAttempt)"
    )
    Task { await _attemptReconnectionWithRetry() }
  }

  private func _attemptReconnectionWithRetry() async {
    let channelsToReconnect = self.lastAttemptedChannelIds
    let initialAttemptNumber = self.reconnectionAttempt
    guard !channelsToReconnect.isEmpty else { return }

    print(
      "[\(Date())] [Attempt #\(initialAttemptNumber)] Starting reconnection process"
    )

    do { try await Task.sleep(for: .seconds(1)) } catch { return }

    guard !self.lastAttemptedChannelIds.isEmpty else {
      print(
        "[\(Date())] [Attempt #\(initialAttemptNumber)] Aborted after delay."
      )
      self.reconnectionAttempt = 0
      return
    }

    do {
      try await RetryUtility.executeWithRetry(
        maxRetries: 4,
        maxBackoffMs: 5000,
        operationName: "SubscribeToMessages",
        shouldRetry: { error in !self.lastAttemptedChannelIds.isEmpty },
        operation: { [weak self] in
          guard let self = self else { throw CancellationError() }
          guard !self.lastAttemptedChannelIds.isEmpty else {
            throw ReconnectionAbortedError()
          }
          print(
            "[\(Date())] Attempting subscribe via RetryUtility..."
          )
          // The operation is to call the main subscribe function
          try self.subscribeToMessages(channelIds: channelsToReconnect)
          print(
            "[\(Date())] subscribeToMessages call completed within RetryUtility."
          )
        }
      )
      print(
        "[\(Date())] [Attempt #\(initialAttemptNumber)] RetryUtility completed successfully."
      )
    } catch is ReconnectionAbortedError {
      print(
        "[\(Date())] [Attempt #\(initialAttemptNumber)] Reconnection explicitly aborted."
      )
    } catch {
      print(
        "[\(Date())] [Attempt #\(initialAttemptNumber)] Failed after retries: \(error)"
      )
      self.updateConnectionState(.disconnected)
      self.reconnectionAttempt = 0
      self.lastAttemptedChannelIds = []  // Give up
      print(
        "[\(Date())] Cleared channels after final retry failure."
      )
    }
  }

  private func updateConnectionState(_ newState: ConnectionState) {
    if self.connectionState != newState {
      self.connectionState = newState
      print(
        "[\(Date())] Internal connection state updated to: \(newState)"
      )
    }
  }
}

priva-byte avatar Apr 09 '25 05:04 priva-byte

@priva-byte Thanks for the information. I will use your code snippet to try and recreate the issue locally.

harsh62 avatar Apr 10 '25 23:04 harsh62

This issue happens to me consistently when I have VPN enabled (NordVPN in particular, I don't know if other VPNs cause this problem or not). What I noticed is that AmplifyNetworkMonitor closure never triggers, so the initial value stays as .none, and when WebsocketClient calls await self?.networkMonitor.updateState(.offline) nothing happens, because it doesn't handle this case ((.none, .offline)):

WebsocketClient.swift:

switch stateChange {
        case (.online, .offline):
            log.debug("[WebSocketClient] NetworkMonitor - Device went offline")
            self.connection?.cancel(with: .invalid, reason: nil)
            self.subject.send(.disconnected(.invalid, nil))
        case (.offline, .online):
            log.debug("[WebSocketClient] NetworkMonitor - Device back online")
            await self.createConnectionAndRead()
        default:
            break
        }

almasios avatar Apr 29 '25 16:04 almasios

@almasios

Thanks for providing that information, I will try to see if I can repro the issue with VPN and the switch you provided.

harsh62 avatar Apr 29 '25 17:04 harsh62

@almasios I have a potential fix made in the attached PR. Would you be able to validate if this fixes your issue?

Branch to test: https://github.com/aws-amplify/amplify-swift/tree/subscription-failures (subscription-failures)

harsh62 avatar May 21 '25 16:05 harsh62

@harsh62 unfortunately this branch does not fix the issue, I still see it get stuck in a weird state where it doesn't reconnect. I attached some WebSocketClient logs, hopefully it helps.

WebSocketClient.log

almasios avatar May 21 '25 19:05 almasios

Thanks for that info. Investigating further.

harsh62 avatar May 22 '25 16:05 harsh62

@almasios @priva-byte Are you guys testing on a real device or the simulator? I have been testing this on a macOS app (and a real iOS device) and hasn't faced any problems.

I think the problem is with the simulator unable to detect the network state changes correctly and hence the AmplifyNetworkMonitor not delivering state changes correctly resulting in resubscription failures.

So just want to make sure, its not the simulator tripping us.

harsh62 avatar May 23 '25 03:05 harsh62

@almasios @priva-byte Are you guys testing on a real device or the simulator? I have been testing this on a macOS app (and a real iOS device) and hasn't faced any problems.

I think the problem is with the simulator unable to detect the network state changes correctly and hence the AmplifyNetworkMonitor not delivering state changes correctly resulting in resubscription failures.

So just want to make sure, its not the simulator tripping us.

I am testing on a real device, iPhone 15 Pro Max, iOS 18.4.1, with VPN connected.

almasios avatar May 23 '25 13:05 almasios

@almasios And you are only able to reproduce the issue when connecting and disconnecting to VPN? From the logs you shared earlier, it did reconnect successfully several times, did you do something different at the very end?

harsh62 avatar May 23 '25 15:05 harsh62

@almasios Would you be able to answer my last question?

harsh62 avatar Jun 02 '25 15:06 harsh62

@harsh62 I see it happening with or without VPN, its just that with VPN it happens consistently, it is much more difficult to reproduce without VPN.

In the logs that I attached earlier you can see that the WebSocket disconnected with error 57 at Timestamp: 2025-05-21 13:47:46.527328-05:00, and it didn't attempt to reconnect until Timestamp: 2025-05-21 13:48:48.679714-05:00, so it was sitting idle for a whole minute.

Only when the NetworkMonitor reported that it was back online did the WebSocket attempt to reconnect. The issue is that the device was online this whole time, it's just that the NetworkMonitor is not always reliably reporting connectivity state, which causes WebSocket to sit idle.

almasios avatar Jun 03 '25 14:06 almasios

The issue is that the device was online this whole time, it's just that the NetworkMonitor is not always reliably reporting connectivity state, which causes WebSocket to sit idle.

@almasios This is exactly what I have seen but only on the simulator. And I am unable to reproduce this on the device I have. Are you able to try this out on another iOS version or device to rule out an OS/Device specific issue.

harsh62 avatar Jun 03 '25 15:06 harsh62

@almasios I had one more customer having a similar issue as yours and they were able to verify the latest version of Amplify fixes their issue. Please give that a try and let me know if it solves your problem.

harsh62 avatar Jun 17 '25 02:06 harsh62

hi, I am suffered from the same issue with v2.50.0. In my case, when I run my app in debug mode from xcode, it was not reproduced. when I run it without sim card, it is not reproduced. I could reproduce only when I run the app with sim card (has cellular service) and wifi connected on my iPhone 16e and iPhone 12.

akiramur avatar Sep 17 '25 04:09 akiramur

Hi @akiramur thanks for the additional information. Someone one the team will need to have a look.

mattcreaser avatar Sep 22 '25 14:09 mattcreaser