planetary-ios icon indicating copy to clipboard operation
planetary-ios copied to clipboard

Replication stops for one of the feeds

Open boreq opened this issue 2 years ago • 12 comments

The replication for identity @6ilZq3kN0F+dXFHAPjAwMm87JEb/VdB+LC9eIMW3sa0=.ed25519 gets stuck. The last message that can be replicated for this feed is %QRrL2wc9o4LoPb2We8FbZC7WlbsfBRM7CnLeguNmILw=.sha256.

boreq avatar May 10 '22 13:05 boreq

This could be somehow related to #520, the same feed is involved.

boreq avatar May 10 '22 13:05 boreq

Issue was reported to be intermittent unfortunately. This may mean that it will be hard to replicate it.

boreq avatar May 10 '22 20:05 boreq

Could be related to #354.

boreq avatar May 10 '22 21:05 boreq

I think this is a situation very similar to #524. I believe that when the replication stops the list of connected peers also stops updating which suggests that maybe some lock isn't released properly.

edit: ok, apparently the replication sometimes stops randomly even though the connected peer list is still being updated so there are most likely multiple bugs at play here.

boreq avatar May 12 '22 18:05 boreq

I am seeing this really consistently now. Steps to reproduce are:

  • Create a new Planetary account on the main net (version 1.2.0)
  • Redeem an invite to our test pub: test.planetary.pub:8008:@wV/CVmFzOOV9MhT2onkPFNaZ25UtJrW2oH0h26ddlQY=.ed25519~o8sO3ugLLWpcrCuUGvmeTSTLdE1nwX3fxmR3ur/VIas=
  • Follow any user
  • Delete Planetary
  • relaunch
  • Exactly 11590 messages will replicate, then replication stops. No posts are ever displayed in the home or discover tabs. I see this in the logs:
ts="2022-05-12 21:43:31.3296130 (UTC)" module=graph event=error msg="recurse failed" err="recurseHops(1): couldnt estblish subfeeds for @QbPdbk9vbQ0TtyD3wBtzNLs/JpK7Nt4IGpeh72hNUIk=.ed25519: DB Closed"

mplorentz avatar May 12 '22 21:05 mplorentz

I am aware of this @mplorentz but so far I've been unable to explain why Badger is just seemingly being closed at some point.

boreq avatar May 13 '22 12:05 boreq

My steps to reproduce in the comment above are no longer valid. I think that flow was fixed in #554.

However I am still seeing the behavior described in the ticket description. I just checked the feed mentioned in this ticket, @6ilZq3kN0F+dXFHAPjAwMm87JEb/VdB+LC9eIMW3sa0=.ed25519, on my main device running release/1.2.0 and the latest post I see from them is from 5 years ago. It isn't %QRrL2wc9o4LoPb2We8FbZC7WlbsfBRM7CnLeguNmILw=.sha256 which is mentioned in the ticket description, but it is another message that is also 5 years old, %43k0U0uGU87AFvvjfRUpq0RFalA7BxWEd1TOe8m5v1g=.sha256.

It's still hard to say if we are really never going to replicate more messages for this profile or if they are just not being requested enough. And I'm not sure if this bug is bad enough that it needs to be fixed for the migration release. But it appears to still be a problem sometimes.

mplorentz avatar May 23 '22 20:05 mplorentz

I checked this feed again today, and it had more messages, but they were still like 4 or 5 years old. So I think maybe the problem is not that we never get their messages, but that it's just extremely slow. I wonder if we are grabbing 1000 messages for this feed, and then shuffling the want list and not getting any more for a long time?

mplorentz avatar May 25 '22 23:05 mplorentz

I did a lot of migration testing yesterday today and I'm fairly certain that as the number of feeds we want to replicate grows, our replication speed drops at a rate faster than if the procedure scaled linearly.

This seems consistent with my theory that this is caused by having a limited want list that we shuffle randomly. If you have 1000 feeds you want to fetch and you randomly select 10 to replicate every minute, then the closer you get to fetching all 1000 feeds the lower chance that the random feeds you select will be the ones whose posts you need.

So on my test database of 15k messages I can do the migration in 2-3 minutes consistently. But on my 1.3 million message database it took 1.5 hours to do the first 50% of the migration, then another 2 hours to do the next 19%.

I think we changed go-ssb's want list to behave this way in order to reduce the memory usage. Maybe it's better to accept the OOM crashes during the migration than to have it be so slow for large databases, I'm not sure 🤷‍♂️ .

mplorentz avatar Jun 01 '22 22:06 mplorentz

I did some testing removing the replication concurrency limit in #608 and it seems be performing much better. It seems like some of our other changes must have reduced memory usage some because I'm not getting OOM crashes as quickly as before. This change is being tested on TestFlight now and if it goes well I think we can close this ticket.

mplorentz avatar Jun 03 '22 13:06 mplorentz

We're going to wait and see if this is an issue after #640 is complete.

mplorentz avatar Jun 20 '22 20:06 mplorentz

waiting on #721 EBT's

rabble avatar Aug 19 '22 00:08 rabble

We haver EBT's now.

rabble avatar Oct 24 '22 20:10 rabble