planetary-ios icon indicating copy to clipboard operation
planetary-ios copied to clipboard

go-ssb stops responding to Swift

Open mplorentz opened this issue 2 years ago • 5 comments

On release/1.2.0, if you leave the app running long enough go-ssb will eventually stop returning from any function calls. I'm unsure if this is a new issue. Usually it takes several minutes to get into this state, which is longer than the average session time of about 2 minutes. But during the go-ssb migration where we are trying to resync the whole feed it is a big issue.

Here are the steps I take to reproduce this:

  1. Use a real iOS device, not the simulator
  2. Use the main network not the test network
  3. Get a profile with around 100k messages
  4. Delete the Planetary app
  5. Install & launch 1.2.0
  6. Filter log messages to RefreshOperation
  7. When a RefreshOperation hasn't completed for more than 3 minutes, I consider go-ssb to be deadlocked.

I most often see ssbRepoStats() lock first. I'm not sure if this is just because we are calling it most, or if the deadlock is triggered by calling it. It's also interesting that go-ssb continues replicating messages even after this lock happens.

mplorentz avatar May 23 '22 17:05 mplorentz

@boreq in case it's helpful to you I pushed up a branch serialize-go-ssb-calls, which uses a lock in the Swift layer to serialize all calls to Go from Swift. It's basically mimicking the high-level locking that the go-ssb does itself, but you can attach the debugger and see what threads are waiting on the lock, which might be helpful for debugging. I was using it to try to figure out if ssbRepoStats() caused the deadlock every time, but I didn't get far enough to prove or disprove that hypothesis.

mplorentz avatar May 23 '22 21:05 mplorentz

It is ssbBotStatus that hangs, I confirmed this now and that is what I also saw in the past. It is probably unrelated to our code (it isn't caused by our global lock), it happens in go-ssb.

boreq avatar May 24 '22 14:05 boreq

The function locks up in the blob-related code. https://github.com/planetary-social/ssb/blob/9f5526e77c0112430562381f1ab5e6bc29f8dbc4/sbot/status.go#L24 the list of all wants can't be retrieved because a lock is engaged. I am working on fixing this. This is consistent with my past findings.

boreq avatar May 24 '22 15:05 boreq

Currently I believe the problem is related to one of those functions blocking indefinitely.

https://github.com/planetary-social/ssb/blob/9f5526e77c0112430562381f1ab5e6bc29f8dbc4/blobstore/wants.go#L306-L307

boreq avatar May 24 '22 19:05 boreq

I have not seen this problem in a while. I am keeping it open to still have it on our radar but I think those problems were at least partially addressed.

boreq avatar May 31 '22 19:05 boreq

fixed some reasons but there are other times when it hangs. Won't fix those, waiting on scuttlego.

rabble avatar Aug 19 '22 00:08 rabble