lemmy
lemmy copied to clipboard
[Bug]: Established instances (not new installs), Community join/subscribe "Pending" for some busy remote servers
Requirements
- [X] Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
- [X] Did you check to see if this issue already exists?
- [X] Is this only a single bug? Do not put multiple bugs in one issue.
- [ ] Is this a UI / front end issue? Use the lemmy-ui repo.
Summary
On Issue #2685 a few of us decided to open a new issue, as this is happening with established (not new install) 0.7.4 instances. Seemingly overloaded remote servers like lemmy.ml more consistently have problems with this than trying to join/subscribe to less-busy servers.
This PostgreSQL query will identify how many of these pending subscribes you have to remote instances:
SELECT person_id, p.name AS username, community_id, c.name AS community, i.domain, community_follower.published
FROM community_follower
inner join person p on p.id = community_follower.person_id
inner join community c on c.id = community_follower.community_id
inner join instance i on c.instance_id = i.id
WHERE pending='t'
ORDER BY community_follower.published
You can cancel the "Pending" on the client and try again, which sometimes works if it is a different time of the day. But some servers, notably lemmy.ml - seem to consistently not accept the join. I speculate that this is symptom of an overloaded server having some backed SQL problems.
FEATURE REQUEST TOO: I suggest that this query be put into an API call that admins can call and get a status of the problem from their server (and a screen on lemmy-ui admin to view the JSON output). Can we try to get this in before 0.18 release? End-users on multiple instances have mentioned this problem.
Lemmy discussion posting: https://lemmy.ml/post/1370450
Steps to Reproduce
- Go to an instance other than lemmy.ml
- Try to join several remote communities on lemmy.ml
- See pending that does not clear up in the normal 15 seconds or less that it takes
Technical Details
Variety of hosts
Observation: if your local instance is joining a instance that has never had a subscriber, it's interesting to note that many times the first handful of posts will appear, but the join/subscribe will still say "pending" and no new posts or comments will come in. But at least something talks to the remote server well enough to bring in those handful of posts. Example of this problem on Beehaw with a Lemmy.ml community - new comments are not flowing from Lemmy.ml to Beehaw: https://beehaw.org/c/[email protected] and also another stuck: https://lemmyrs.org/c/[email protected]
Version
BE 0.17.4
Lemmy Instance URL
lemmy.ml
Today users in the wild are still reporting this problem: https://lemmy.ml/post/1411060
lemmy.ml CONSISTENTLY has issues for me.
beehaw/lemmy.world, are hit or miss, but, will eventually work.
lemmy.ml CONSISTENTLY has issues for me.
beehaw/lemmy.world, are hit or miss, but, will eventually work.
This has been the constant for me as well, can subscribe to most all with little issue except lemmy.ml . I have noticed that I am getting the lemmy.ml ones that are pending still showing up on the main.
I haven't been able to complete federation with any other instances. I get an initial dump of posts without comments, as expected, but the communities stay as "Subscribe Pending" and nothing else changes. I've tried the unsub/resub countless times, but nothing.
I haven't been able to complete federation with any other instances. I get an initial dump of posts without comments, as expected, but the communities stay as "Subscribe Pending" and nothing else changes.
Is your install new? We are treating that as a 'new install' problem, there have been many people following one of the 3 ways to install (Docker, Ansible, "From Scratch") that have run into one issue or another. Typically it is sever owners installing with Docker and having Issue with lemmyexternalproxy or their hosting provider interfering/firewall/nginx proxy. The programmers are trying not to treat "new install" as bugs and asking you use the support forums. There are closed bug reports you can search where it was 'new install' problems, such as #2990 and #3167
Thanks for the reploy @RocketDerp. I did see issues #2990 and #3167, but those were pretty much to do with people unwittingly blocking outbound access from their instances.
As I've been having this problem since 0.17.3, I wasn't sure if "new install" applied to me or if it was exclusively a 0.17.4 problem.
I'm running an ansible deployment behind Nginx Proxy Manager, and outbound access if definitely working. I can browse my own instance's content from outside the network and websockets all working just fine.
I simply can't subscribe to a remote community and, when I search for one of my local communities from another instance, I can see the request (and subsequent reply, via tcpdump) for /.well-known/webfinger
but nothing else. The remote instance simply says 'No results'.
By "new install" problem, I mean you can't get subscribe to work with any other Lemmy server. For example a community from a newer low-use (growing fast) lemmy server like: https://lemmyrs.org/communities or https://startrek.website/communities
The problem of Issue #3203 (you are here) is overloaded servers dropping subscribe requests, comments and postings (see issue #3101) on a steady basis with current load levels. Typically Beehaw, Lemmy.world, Lemmy.ml and other busy servers. A platform scalability issue. Normally users aren't even noticing the missing content described in issue #3101, but they do notice in the user interface these 'pending' subscribes. Frankly, as of today, a lot of server operators don't even seem to notice that issue #3101 is happening, and I'm seeing end-users mentioning it more often.
If you can't get any Community from any remote Lemmy server to subscribe, you likely have an install problem or network problem that isn't considered a 'bug in the code' - except for possibly an issue in the install documents needing revision: https://github.com/LemmyNet/lemmy-docs/issues
Gotcha. Thanks. I'll keep hunting. In fairness, my gut feel is I've missed something important in my reverse proxy config.
Likely related to this issue, my server log is showing incoming activity that has 'Header is Expired' on the HTTP. Lemmy federation logic is very aggressive in using a short time window in 0.17.4 - and it's entirely possible clock differences between servers and/or retry logic is causing failures. See issue https://github.com/LemmyNet/activitypub-federation-rust/issues/46
@DeltaTangoLima maybe https://github.com/LemmyNet/lemmy/issues/2685#issuecomment-1600675906 is helpful to you? It solved this problem on my instance.
Thanks @arjan-s, I did manage to figure it out. There was some extra config required in the reverse proxy config, namely:
location / {
set $proxpass http://[ui_host_ip]:1234;
if ($http_accept ~ "^application/.*$") {
set $proxpass http://[backend_host_ip]:8536;
}
if ($request_method = POST) {
set $proxpass http://[backend_host_ip]:8536;
}
proxy_pass $proxpass;
rewrite ^(.+)/+$ $1 permanent;
# Send actual client IP upstream
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
Happily federating everywhere (except lemmy.world - that instance is just getting smashed right now).
This problem is still going. From lemmy.ml running 0.18.0 I tried to subscribe to SJW also running 0.18.0
About 1 minute before this comment, we really need to get the server logs out of lemmy.ml and see what is crashing!
This seems to be happening on all instances, not just the busy ones. For example https://lemdit.com is not busy but subscribing to and from it does the same thing. I've tried several instances at random and they all exhibit this problem.
The logs aren't very conclusive, I think the relevant bit is:
Jun 25 11:01:30 lemdit.com lemmy_server[27525]: 2023-06-24T23:01:30.355877Z INFO HTTP request{http.method=POST http.scheme="https" http.host=lemdit.com http.target=/api/v3/community/follow otel.kind="server" request_id=b2f9c2da-743e-4058-a19b-1e8b339d74c3}:send:send_lemmy_activity: activitypub_federation::activity_queue: stats_fmt="Activity queue stats: pending: 1, running: 0, retries: 0, dead: 0, complete: 1"
So it looks like the request completes, but it still shows as pending also? Not sure how to interpret this.
activitypub_federation::activity_queue: stats_fmt="Activity queue stats: pending: 1, running: 0, retries: 0, dead: 0, complete: 1"
I don't think that 'pending' means the same as the database 'pending' field. It's a different context of meaning.
I speculate that the Join/Subscribe community logic is a two-way transaction between servers. I have seen it take 10 seconds elapsed until it would self-update from 'Pending' on the lemmy-ui webapp to 'Joined' (this was in 0.17.4 that had websockets dong dynamic updates to the UI). I think there are multiple points of failure that I've seen, where the local (virgin first user on community) instance grabs a handful of postings but no comments, then the "Joined' may or may not happen, and I've even seen where the new comments and post flow - but "Joined" never shows (but this doesn't happen very often).
I expect there are both https connection failures to peer servers from either origin and likely SQL transactions on the busy remote server causing problems.
The problem was obvious and going on long before I opened this issue. I figured the big admins of the big servers were seeing the problem. Then I realized that performance-related problems on the big servers are not being shared here on GirHub or in Lemmy forums... the constant nginx 500 crashes on severs all over weren't being opened as a bug.
In this issue (which is an assert/escalation from #2685 as to the importance of the problem), I recommended that 0.18.0 take data failures more seriously and put in the SQL query.... "FEATURE REQUEST TOO: I suggest that this query be put into an API call that admins can call and get a status of the problem from their server (and a screen on lemmy-ui admin to view the JSON output). Can we try to get this in before 0.18 release?"
I see servers being upgraded all month, throwing hardware at the problem. I also see recommendations to "go to less busy servers", when federation isn't reliably replicating data (and no open issue on GitHub)... and each new server that goes online adds to the load of the big servers in regard to replication of each and every comment and posting. What I don't see is anyone running the big servers sharing their internal logs like your comment just did.
How much LOUDER can I be contacting the ADMINS of the BIG SERVERS?
The logs aren't very conclusive
I think your send is just an INFO and went, it's the busy servers (the established ones with lots of data in their SQL tables) that are faulting. And I don't see issues being opened and log being shared.
If this SQL statement had been added to 0.18.0 release, or even run it manually ,I think the big servers would show tons of rows. Large numbers of users who have stuck 'Pending' join/subscribe.
I've spent over 60 hours in the past 10 days going from server to server and seeing how much federation is falling over every single day, and nobody running these 'big' severs opening issues or sharing logs.
Just now, at the time of this comment, lemmy.ml is having the problem to lemmy.world server. We really need to get log dumps out of these big servers and see what frequent errors the Rust code is logging...
Just now, at the time of this comment, lemmy.ml is having the problem to lemmy.world server. We really need to get log dumps out of these big servers and see what frequent errors the Rust code is logging...
Keep in mind lemmy.world is still on "version": "0.17.4"
while lemmy.ml has upgraded: "version": "0.18.0"
. This might just be fixed in 0.18.0.
For perspective on scaling problems with this problem and how federation does real-time connects with a 10-second http timeout, see issue #2180
Things are far better with the performance fixes installed on Lemmy.world and Lemmy.ml - I'm inclined to close this issue since PostgreSQL is no longer constantly overloaded on any of the servers that have updated and things are working much more smoothly.
I'm going on over a week with subscribe pending for every large community (0.18.2)
Is it a new server install? Has any remote instance subscribe ever worked for you? There are a number of closed issues here with troubleshooting information about new installs.
@RocketDerp I did the troubleshooting curl requests and the user and post ones worked. I don't have a local community so couldn't test that. I found a random instance with only a few users and the join worked within 30 seconds. I have unjoined / rejoined the bigger ones a couple of times
I'm going on over a week with subscribe pending for every large community (0.18.2)
Subscribes are not retried automatically. You need to unsubscribe and resubscribe, then it may go through.
I'm going to close it because the performance problems specific to the June 2023 time period have largely passed, a new issue can be opened.