planetary-ios
planetary-ios copied to clipboard
Replication stops for one of the feeds
The replication for identity @6ilZq3kN0F+dXFHAPjAwMm87JEb/VdB+LC9eIMW3sa0=.ed25519
gets stuck. The last message that can be replicated for this feed is %QRrL2wc9o4LoPb2We8FbZC7WlbsfBRM7CnLeguNmILw=.sha256
.
This could be somehow related to #520, the same feed is involved.
Issue was reported to be intermittent unfortunately. This may mean that it will be hard to replicate it.
Could be related to #354.
I think this is a situation very similar to #524. I believe that when the replication stops the list of connected peers also stops updating which suggests that maybe some lock isn't released properly.
edit: ok, apparently the replication sometimes stops randomly even though the connected peer list is still being updated so there are most likely multiple bugs at play here.
I am seeing this really consistently now. Steps to reproduce are:
- Create a new Planetary account on the main net (version 1.2.0)
- Redeem an invite to our test pub:
test.planetary.pub:8008:@wV/CVmFzOOV9MhT2onkPFNaZ25UtJrW2oH0h26ddlQY=.ed25519~o8sO3ugLLWpcrCuUGvmeTSTLdE1nwX3fxmR3ur/VIas=
- Follow any user
- Delete Planetary
- relaunch
- Exactly 11590 messages will replicate, then replication stops. No posts are ever displayed in the home or discover tabs. I see this in the logs:
ts="2022-05-12 21:43:31.3296130 (UTC)" module=graph event=error msg="recurse failed" err="recurseHops(1): couldnt estblish subfeeds for @QbPdbk9vbQ0TtyD3wBtzNLs/JpK7Nt4IGpeh72hNUIk=.ed25519: DB Closed"
I am aware of this @mplorentz but so far I've been unable to explain why Badger is just seemingly being closed at some point.
My steps to reproduce in the comment above are no longer valid. I think that flow was fixed in #554.
However I am still seeing the behavior described in the ticket description. I just checked the feed mentioned in this ticket, @6ilZq3kN0F+dXFHAPjAwMm87JEb/VdB+LC9eIMW3sa0=.ed25519
, on my main device running release/1.2.0 and the latest post I see from them is from 5 years ago. It isn't %QRrL2wc9o4LoPb2We8FbZC7WlbsfBRM7CnLeguNmILw=.sha256
which is mentioned in the ticket description, but it is another message that is also 5 years old, %43k0U0uGU87AFvvjfRUpq0RFalA7BxWEd1TOe8m5v1g=.sha256
.
It's still hard to say if we are really never going to replicate more messages for this profile or if they are just not being requested enough. And I'm not sure if this bug is bad enough that it needs to be fixed for the migration release. But it appears to still be a problem sometimes.
I checked this feed again today, and it had more messages, but they were still like 4 or 5 years old. So I think maybe the problem is not that we never get their messages, but that it's just extremely slow. I wonder if we are grabbing 1000 messages for this feed, and then shuffling the want list and not getting any more for a long time?
I did a lot of migration testing yesterday today and I'm fairly certain that as the number of feeds we want to replicate grows, our replication speed drops at a rate faster than if the procedure scaled linearly.
This seems consistent with my theory that this is caused by having a limited want list that we shuffle randomly. If you have 1000 feeds you want to fetch and you randomly select 10 to replicate every minute, then the closer you get to fetching all 1000 feeds the lower chance that the random feeds you select will be the ones whose posts you need.
So on my test database of 15k messages I can do the migration in 2-3 minutes consistently. But on my 1.3 million message database it took 1.5 hours to do the first 50% of the migration, then another 2 hours to do the next 19%.
I think we changed go-ssb's want list to behave this way in order to reduce the memory usage. Maybe it's better to accept the OOM crashes during the migration than to have it be so slow for large databases, I'm not sure 🤷♂️ .
I did some testing removing the replication concurrency limit in #608 and it seems be performing much better. It seems like some of our other changes must have reduced memory usage some because I'm not getting OOM crashes as quickly as before. This change is being tested on TestFlight now and if it goes well I think we can close this ticket.
We're going to wait and see if this is an issue after #640 is complete.
waiting on #721 EBT's
We haver EBT's now.