apollo-ios icon indicating copy to clipboard operation
apollo-ios copied to clipboard

Potential Deadlock

Open tunds opened this issue 8 months ago • 11 comments

Summary

I've noticed that we have some scenarios when using the Apollo GraphQL subscriptions can deadlock when attempting to reconnect due to network disruptions. We have noticed this can happen quite often when we are using our internal VPN (GlobalProtect) if there is a disconnect and the subscriptions attempt to reconnect the application will freeze.

When inspecting what is going on the main thread this function seems to be potentially causing the app to hang.

  public func connect() {
    serialQueue.sync {
      guard !self.isConnecting else { return }
      self.didDisconnect = false
      self.isConnecting = true
      self.createHTTPRequest()
    }
  }

This all starts from the attemptReconnectionIfDesired function.

Version

1.17.0

Steps to reproduce the behavior

  1. Run app on a simulator to real device
  2. Cause a disconnect i.e. from VPN moving between networks
  3. Cause SDK to attempt a reconnect
  4. Observe that the reconnect causes a hang

Logs


Anything else?

Image

tunds avatar Apr 22 '25 19:04 tunds

Hi @tunds - I think you're experiencing deadlocks in the websocket. We've released several versions recently to try mitigate the bug; 1.15.3, 1.16.1, and 1.18.0.

Since you're on 1.17.0, I recommend updating to 1.18.0 and monitoring it. We've seen improved stability of the websocket since that release. For now I'm going to close this issue since I believe we have the issue resolved. If however, after updating to 1.18.0, you still notice the issue please comment back here and we can re-open and investigate.

calvincestari avatar Apr 22 '25 20:04 calvincestari

Do you have any feedback for the maintainers? Please tell us by taking a one-minute survey. Your responses will help us understand Apollo iOS usage and allow us to serve you better.

github-actions[bot] avatar Apr 22 '25 20:04 github-actions[bot]

Hi @calvincestari,

Unfortunately we're going to have to reopen this issue. We've updated to 1.18.0 and we're still seeing the issue that I have posted above.

tunds avatar Apr 25 '25 13:04 tunds

@tunds - that's fine. You'll need to try put together something that can reproduce the issue though. This has been an incredibly difficult issue to reliably reproduce and the sample app we had no longer fails our test cases. If you have suggested changes for the websocket code we'll gladly review it.

calvincestari avatar Apr 25 '25 16:04 calvincestari

Hi @calvincestari,

Yh I can imagine this isn't an easy one to fix at all since it's not really obvious. One thing I will point out is it seems to occur quite often when we're using our company VPN and a proxying tool i.e. Proxyman. Sometimes this VPN can disconnect and timeout causing the websocket to auto retry connecting in the background, whilst using the proxying tool.

So I wonder if the best way to try to reproduce this is by causing the system to change IP very frequently in a short amount of time which would cause this to reconnect code to fire rapidly possibly?

It's more the steps for this scenario that need to be taken here.

tunds avatar Apr 25 '25 16:04 tunds

@tunds, we have another release coming out later today that includes a change for websocket deadlocks. Not sure if it'll fix your issue but it is in the same code and would be worth a try.

calvincestari avatar Apr 29 '25 20:04 calvincestari

@calvincestari Awesome news!!!, Will update you. Happy for you to close this issue and then I can leave a comment for you to reopen if we find the issue again. Unless you want to keep it open to monitor this?

tunds avatar Apr 30 '25 05:04 tunds

We can leave this one open for now.

calvincestari avatar Apr 30 '25 12:04 calvincestari

I have seen the same issue on 1.21.0 again unfortunately:(

tunds avatar May 02 '25 08:05 tunds

We're going to be re-writing the entire web socket layer for 2.0 in the coming months. We've spent a lot of time chasing down race conditions in the existing web socket layer in the past few months, but in order to completely remove these race conditions and dead locks, a rewrite is really necessary.

If you'd like to take a crack at a PR to address this, we're happy to take a look, but we likely won't be investing much more time into tracking these down and instead will focus our efforts on the rewrite.

AnthonyMDev avatar May 09 '25 17:05 AnthonyMDev

No worries, appreciate the effort by the team to look into this. Looking forward to 2.0.0 and the great things you bring 🫡

tunds avatar May 09 '25 17:05 tunds