lemmy icon indicating copy to clipboard operation
lemmy copied to clipboard

tasks: run instance check every 10min but only for outdated instances

Open phiresky opened this issue 2 years ago • 20 comments

Instead of running the instances update only every 24h, this runs it every 10 minutes - but it only runs it for instances that have not been updated in > 24h.

This should fix the issue of instances being marked dead when the server does not run continuously for 24h+.

Should fix #4039

phiresky avatar Dec 26 '23 17:12 phiresky

Lemmy.ml has over 5000 instances that match this criteria. So there would be 5000 outgoing requests every 10 minutes, that definitely wont work. It would be better if the check interval would increase the longer an instance is unreachable. Eg during the first week check every hour, but after a week only once a day, later once a week and so on.

Edit: It might also be worth using a longer timeout for these checks. Right now its 10s which might be too short if servers are overloaded.

Nutomic avatar Dec 27 '23 10:12 Nutomic

I think you misunderstood what I wrote, this PR doesn't increase the number of requests at all. In fact, it should more evenly distribute the load throughout the day

phiresky avatar Dec 27 '23 10:12 phiresky

Ah you're actually right - for instances that are dead this would increase the load because they are never updated. It's "only" 2000 instances on lemmy.ml but yes. This would need another column in that table that keeps track of "last_checked" , which I think would be a good idea.

phiresky avatar Dec 27 '23 10:12 phiresky

Or instead of increasing the request timeout, it would be better to add caching for nodeinfo so it doesnt have to read from the database each time (similar to what I did recently for /api/v3/site).

This would need another column in that table that keeps track of "last_checked" , which I think would be a good idea.

Thats exactly what the updated column is, so no need for a new column.

Nutomic avatar Dec 27 '23 14:12 Nutomic

Thats exactly what the updated column is, so no need for a new column.

Well no because it's only set when the instance is alive, not when it's dead. So if you want to keep track of when it was last checked for being online, that's a separate variable.

Anyways the question here isn't really how to reduce server load from the current state, but how to make the dead marking process more robust.

phiresky avatar Dec 27 '23 19:12 phiresky

Ah so basically a column for last_successful_check and another for last_failed_check, I guess that makes sense. The current server load is fine, but this PR would mean that each dead instance is checked 24 * 6 = 144 times per day instead of one time, and thats too much.

Nutomic avatar Jan 02 '24 10:01 Nutomic

Might also be good to change the name of the updated column -> last_successful_check.

So to clarify, this is to make sure that living instances are checked more frequently than dead ones? On first reading, this makes it seem like dead instances should be checked more frequently than living ones, which doesn't make much sense to me.

dessalines avatar Jan 03 '24 15:01 dessalines

The goal of this change is to fix #4039 , not by increasing the check frequency at all actually, just by increasing the opportunities for the check to happen.

The current code is broken though, needs changes

phiresky avatar Jan 04 '24 00:01 phiresky

I'm thinking about also adding an explicit dead Boolean column instead of having it implicit. Thoughts?

phiresky avatar Jan 04 '24 00:01 phiresky

For what reason? Seems easy enough to check if a date is more than three days ago, and having the same info in multiple columns can easily lead to inconsistencies.

Nutomic avatar Jan 04 '24 09:01 Nutomic

My reason would purely be to defensively prevent the issue of federation stopping working when the update doesn't happen, since I think we still don't know exactly what causes #4039 - the reports seem to have been conflicting, if it's just people having their instance down at midnight we can explain that but it people have also reported it happening when their instance was up consistently.

But yeah I guess you're right. Especially if we have two columns (last_seen and last_checked) then the dead filter can be last_checked > 2 days ago & last_seen < 3 days ago.

phiresky avatar Jan 04 '24 10:01 phiresky

It would also fail if the remote instance is down at midnight.

Nutomic avatar Jan 04 '24 11:01 Nutomic

Actually, just a thought, with the new federation queue we might not need the dead check at all? Because the queue already has an exponentially increasing delay and it only sends one activity at a time anyways, so it doesn't really need the instance dead filter at all.

What else do we use the instance check / stored software version for?

phiresky avatar Jan 05 '24 13:01 phiresky

I think otherwise it is mainly used for /api/v3/federated_instances. I think it would be possible to get rid of the existing dead instance check, if the federation queue gets its own logic to avoid spawning workers for unreachable instances.

Nutomic avatar Jan 05 '24 14:01 Nutomic

Right now it does spawn a task for every instance but the workers don't do anything but sleep for hours / days at a time (which I think shouldn't affect resources much / any?)

phiresky avatar Jan 05 '24 15:01 phiresky

I dont have any hard data, but I would assume that 5000 idle tasks for dead instances would have some kind of impact. Especially when Lemmy is started, and each task reads federation queue state from the db.

Nutomic avatar Jan 05 '24 15:01 Nutomic

Right now it does spawn a task for every instance but the workers don't do anything but sleep for hours / days at a time (which I think shouldn't affect resources much / any?)

It would use a lot of RAM

dullbananas avatar Jan 05 '24 21:01 dullbananas

It would use a lot of RAM

I don't see why one of them would take more than 10kB of RAM which even with 5k instances would only be 50MByte RAM

phiresky avatar Jan 05 '24 22:01 phiresky

Also, many federation workers could be active at the same time if the delays line up, causing spikes in resource usage

dullbananas avatar Jan 05 '24 22:01 dullbananas

Why not run it once on startup and then based on configurable schedule and make it by default just once every 24/h

MV-GH avatar Jan 05 '24 23:01 MV-GH

Gonna close this, I believe its sufficiently handled by https://github.com/LemmyNet/lemmy/pull/4377

Nutomic avatar Mar 04 '24 10:03 Nutomic