lemmy icon indicating copy to clipboard operation
lemmy copied to clipboard

[Bug]: Established instances (not new installs), Community join/subscribe "Pending" for some busy remote servers

Open RocketDerp opened this issue 1 year ago • 20 comments

Requirements

  • [X] Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • [X] Did you check to see if this issue already exists?
  • [X] Is this only a single bug? Do not put multiple bugs in one issue.
  • [ ] Is this a UI / front end issue? Use the lemmy-ui repo.

Summary

On Issue #2685 a few of us decided to open a new issue, as this is happening with established (not new install) 0.7.4 instances. Seemingly overloaded remote servers like lemmy.ml more consistently have problems with this than trying to join/subscribe to less-busy servers.

This PostgreSQL query will identify how many of these pending subscribes you have to remote instances:

SELECT person_id, p.name AS username, community_id, c.name AS community, i.domain, community_follower.published
	FROM community_follower
	inner join person p on p.id = community_follower.person_id
	inner join community c on c.id = community_follower.community_id
	inner join instance i on c.instance_id = i.id
	WHERE pending='t'
	ORDER BY community_follower.published

You can cancel the "Pending" on the client and try again, which sometimes works if it is a different time of the day. But some servers, notably lemmy.ml - seem to consistently not accept the join. I speculate that this is symptom of an overloaded server having some backed SQL problems.

FEATURE REQUEST TOO: I suggest that this query be put into an API call that admins can call and get a status of the problem from their server (and a screen on lemmy-ui admin to view the JSON output). Can we try to get this in before 0.18 release? End-users on multiple instances have mentioned this problem.

Lemmy discussion posting: https://lemmy.ml/post/1370450

Steps to Reproduce

  1. Go to an instance other than lemmy.ml
  2. Try to join several remote communities on lemmy.ml
  3. See pending that does not clear up in the normal 15 seconds or less that it takes

Technical Details

Variety of hosts

Observation: if your local instance is joining a instance that has never had a subscriber, it's interesting to note that many times the first handful of posts will appear, but the join/subscribe will still say "pending" and no new posts or comments will come in. But at least something talks to the remote server well enough to bring in those handful of posts. Example of this problem on Beehaw with a Lemmy.ml community - new comments are not flowing from Lemmy.ml to Beehaw: https://beehaw.org/c/[email protected] and also another stuck: https://lemmyrs.org/c/[email protected]

Version

BE 0.17.4

Lemmy Instance URL

lemmy.ml

RocketDerp avatar Jun 19 '23 21:06 RocketDerp

Today users in the wild are still reporting this problem: https://lemmy.ml/post/1411060

RocketDerp avatar Jun 21 '23 14:06 RocketDerp

lemmy.ml CONSISTENTLY has issues for me.

beehaw/lemmy.world, are hit or miss, but, will eventually work.

XtremeOwnageDotCom avatar Jun 21 '23 16:06 XtremeOwnageDotCom

lemmy.ml CONSISTENTLY has issues for me.

beehaw/lemmy.world, are hit or miss, but, will eventually work.

This has been the constant for me as well, can subscribe to most all with little issue except lemmy.ml . I have noticed that I am getting the lemmy.ml ones that are pending still showing up on the main.

SiskoUrso avatar Jun 21 '23 16:06 SiskoUrso

I haven't been able to complete federation with any other instances. I get an initial dump of posts without comments, as expected, but the communities stay as "Subscribe Pending" and nothing else changes. I've tried the unsub/resub countless times, but nothing.

DeltaTangoLima avatar Jun 21 '23 20:06 DeltaTangoLima

I haven't been able to complete federation with any other instances. I get an initial dump of posts without comments, as expected, but the communities stay as "Subscribe Pending" and nothing else changes.

Is your install new? We are treating that as a 'new install' problem, there have been many people following one of the 3 ways to install (Docker, Ansible, "From Scratch") that have run into one issue or another. Typically it is sever owners installing with Docker and having Issue with lemmyexternalproxy or their hosting provider interfering/firewall/nginx proxy. The programmers are trying not to treat "new install" as bugs and asking you use the support forums. There are closed bug reports you can search where it was 'new install' problems, such as #2990 and #3167

RocketDerp avatar Jun 21 '23 21:06 RocketDerp

Thanks for the reploy @RocketDerp. I did see issues #2990 and #3167, but those were pretty much to do with people unwittingly blocking outbound access from their instances.

As I've been having this problem since 0.17.3, I wasn't sure if "new install" applied to me or if it was exclusively a 0.17.4 problem.

I'm running an ansible deployment behind Nginx Proxy Manager, and outbound access if definitely working. I can browse my own instance's content from outside the network and websockets all working just fine.

I simply can't subscribe to a remote community and, when I search for one of my local communities from another instance, I can see the request (and subsequent reply, via tcpdump) for /.well-known/webfinger but nothing else. The remote instance simply says 'No results'.

DeltaTangoLima avatar Jun 21 '23 21:06 DeltaTangoLima

By "new install" problem, I mean you can't get subscribe to work with any other Lemmy server. For example a community from a newer low-use (growing fast) lemmy server like: https://lemmyrs.org/communities or https://startrek.website/communities

The problem of Issue #3203 (you are here) is overloaded servers dropping subscribe requests, comments and postings (see issue #3101) on a steady basis with current load levels. Typically Beehaw, Lemmy.world, Lemmy.ml and other busy servers. A platform scalability issue. Normally users aren't even noticing the missing content described in issue #3101, but they do notice in the user interface these 'pending' subscribes. Frankly, as of today, a lot of server operators don't even seem to notice that issue #3101 is happening, and I'm seeing end-users mentioning it more often.

If you can't get any Community from any remote Lemmy server to subscribe, you likely have an install problem or network problem that isn't considered a 'bug in the code' - except for possibly an issue in the install documents needing revision: https://github.com/LemmyNet/lemmy-docs/issues

RocketDerp avatar Jun 21 '23 21:06 RocketDerp

Gotcha. Thanks. I'll keep hunting. In fairness, my gut feel is I've missed something important in my reverse proxy config.

DeltaTangoLima avatar Jun 21 '23 21:06 DeltaTangoLima

Likely related to this issue, my server log is showing incoming activity that has 'Header is Expired' on the HTTP. Lemmy federation logic is very aggressive in using a short time window in 0.17.4 - and it's entirely possible clock differences between servers and/or retry logic is causing failures. See issue https://github.com/LemmyNet/activitypub-federation-rust/issues/46

RocketDerp avatar Jun 21 '23 22:06 RocketDerp

@DeltaTangoLima maybe https://github.com/LemmyNet/lemmy/issues/2685#issuecomment-1600675906 is helpful to you? It solved this problem on my instance.

arjan-s avatar Jun 22 '23 07:06 arjan-s

Thanks @arjan-s, I did manage to figure it out. There was some extra config required in the reverse proxy config, namely:

location / {
      set $proxpass http://[ui_host_ip]:1234;
      if ($http_accept ~ "^application/.*$") {
        set $proxpass http://[backend_host_ip]:8536;
      }
      if ($request_method = POST) {
        set $proxpass http://[backend_host_ip]:8536;
      }
      proxy_pass $proxpass;

      rewrite ^(.+)/+$ $1 permanent;

      # Send actual client IP upstream
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header Host $host;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

Happily federating everywhere (except lemmy.world - that instance is just getting smashed right now).

DeltaTangoLima avatar Jun 22 '23 07:06 DeltaTangoLima

This problem is still going. From lemmy.ml running 0.18.0 I tried to subscribe to SJW also running 0.18.0

About 1 minute before this comment, we really need to get the server logs out of lemmy.ml and see what is crashing!

image

RocketDerp avatar Jun 24 '23 20:06 RocketDerp

This seems to be happening on all instances, not just the busy ones. For example https://lemdit.com is not busy but subscribing to and from it does the same thing. I've tried several instances at random and they all exhibit this problem.

The logs aren't very conclusive, I think the relevant bit is:

Jun 25 11:01:30 lemdit.com lemmy_server[27525]: 2023-06-24T23:01:30.355877Z INFO HTTP request{http.method=POST http.scheme="https" http.host=lemdit.com http.target=/api/v3/community/follow otel.kind="server" request_id=b2f9c2da-743e-4058-a19b-1e8b339d74c3}:send:send_lemmy_activity: activitypub_federation::activity_queue: stats_fmt="Activity queue stats: pending: 1, running: 0, retries: 0, dead: 0, complete: 1"

So it looks like the request completes, but it still shows as pending also? Not sure how to interpret this.

delendum avatar Jun 24 '23 23:06 delendum

activitypub_federation::activity_queue: stats_fmt="Activity queue stats: pending: 1, running: 0, retries: 0, dead: 0, complete: 1"

I don't think that 'pending' means the same as the database 'pending' field. It's a different context of meaning.

I speculate that the Join/Subscribe community logic is a two-way transaction between servers. I have seen it take 10 seconds elapsed until it would self-update from 'Pending' on the lemmy-ui webapp to 'Joined' (this was in 0.17.4 that had websockets dong dynamic updates to the UI). I think there are multiple points of failure that I've seen, where the local (virgin first user on community) instance grabs a handful of postings but no comments, then the "Joined' may or may not happen, and I've even seen where the new comments and post flow - but "Joined" never shows (but this doesn't happen very often).

I expect there are both https connection failures to peer servers from either origin and likely SQL transactions on the busy remote server causing problems.

The problem was obvious and going on long before I opened this issue. I figured the big admins of the big servers were seeing the problem. Then I realized that performance-related problems on the big servers are not being shared here on GirHub or in Lemmy forums... the constant nginx 500 crashes on severs all over weren't being opened as a bug.

In this issue (which is an assert/escalation from #2685 as to the importance of the problem), I recommended that 0.18.0 take data failures more seriously and put in the SQL query.... "FEATURE REQUEST TOO: I suggest that this query be put into an API call that admins can call and get a status of the problem from their server (and a screen on lemmy-ui admin to view the JSON output). Can we try to get this in before 0.18 release?"

I see servers being upgraded all month, throwing hardware at the problem. I also see recommendations to "go to less busy servers", when federation isn't reliably replicating data (and no open issue on GitHub)... and each new server that goes online adds to the load of the big servers in regard to replication of each and every comment and posting. What I don't see is anyone running the big servers sharing their internal logs like your comment just did.

How much LOUDER can I be contacting the ADMINS of the BIG SERVERS?

image

The logs aren't very conclusive

I think your send is just an INFO and went, it's the busy servers (the established ones with lots of data in their SQL tables) that are faulting. And I don't see issues being opened and log being shared.

If this SQL statement had been added to 0.18.0 release, or even run it manually ,I think the big servers would show tons of rows. Large numbers of users who have stuck 'Pending' join/subscribe.

I've spent over 60 hours in the past 10 days going from server to server and seeing how much federation is falling over every single day, and nobody running these 'big' severs opening issues or sharing logs.

RocketDerp avatar Jun 24 '23 23:06 RocketDerp

Just now, at the time of this comment, lemmy.ml is having the problem to lemmy.world server. We really need to get log dumps out of these big servers and see what frequent errors the Rust code is logging...

image

RocketDerp avatar Jun 26 '23 15:06 RocketDerp

Just now, at the time of this comment, lemmy.ml is having the problem to lemmy.world server. We really need to get log dumps out of these big servers and see what frequent errors the Rust code is logging...

Keep in mind lemmy.world is still on "version": "0.17.4" while lemmy.ml has upgraded: "version": "0.18.0". This might just be fixed in 0.18.0.

tgxn avatar Jun 27 '23 04:06 tgxn

For perspective on scaling problems with this problem and how federation does real-time connects with a 10-second http timeout, see issue #2180

RocketDerp avatar Jul 02 '23 11:07 RocketDerp

Things are far better with the performance fixes installed on Lemmy.world and Lemmy.ml - I'm inclined to close this issue since PostgreSQL is no longer constantly overloaded on any of the servers that have updated and things are working much more smoothly.

RocketDerp avatar Jul 06 '23 23:07 RocketDerp

I'm going on over a week with subscribe pending for every large community (0.18.2)

MagsMagnoli avatar Jul 14 '23 18:07 MagsMagnoli

Is it a new server install? Has any remote instance subscribe ever worked for you? There are a number of closed issues here with troubleshooting information about new installs.

RocketDerp avatar Jul 14 '23 18:07 RocketDerp

@RocketDerp I did the troubleshooting curl requests and the user and post ones worked. I don't have a local community so couldn't test that. I found a random instance with only a few users and the join worked within 30 seconds. I have unjoined / rejoined the bigger ones a couple of times

MagsMagnoli avatar Jul 16 '23 08:07 MagsMagnoli

I'm going on over a week with subscribe pending for every large community (0.18.2)

Subscribes are not retried automatically. You need to unsubscribe and resubscribe, then it may go through.

Nutomic avatar Jul 20 '23 14:07 Nutomic

I'm going to close it because the performance problems specific to the June 2023 time period have largely passed, a new issue can be opened.

RocketDerp avatar Jul 20 '23 14:07 RocketDerp