gotosocial icon indicating copy to clipboard operation
gotosocial copied to clipboard

[investigation/performance] Unexpectedly high RAM pressure when a post goes "viral"

Open trysdyn opened this issue 2 years ago • 19 comments

I'm filing this as much as a "story" or "what do y'all think?" as I am a bug. Sorry for the weird report.

Version 0.6.0-rc1. I don't believe anything in rc2 or rc3 would impact this but I plan to update soon just to be absolutely sure.

I have my GtS instance on system level monitoring and have been seeing distressingly large spikes in GtS's RAM usage when a post gets carried out beyond the usual reach. A couple of boosts that carry a post beyond normal reach is sufficient to cause a 200+ mb blip in GtS's usage for about a minute. In the case of a small instance hosted on a VPS, a post going slightly "viral" might be enough to bring the instance down.

Fun fact: I mentioned the gotosocial fedi account with this and when they replied, my GtS instance did effectively crash under the impending load after GtS gnawed about 500mb and went well into my swap.

Here's my monitoring of available RAM on the box. This is on a bog standard $5 VPS (1gb RAM with another for swap) using postgres as a data store, which I think is the base SKU for anything targeting "small and self-hosted".

gtsmem

To be clear: it's GtS itself growing during these spikes, not postgres or anything else.

RAM usage is stable (with a small persistent increase) for days except for those large dips. The first occurred with just a few boosts on a post I made, the second occurred when the gotosocial fedi account replied to me.

Logs just show a lot of requests for the statuses in question and my webfinger/profile, with some very unhappy warnings about DB query time. The slow query log hits look like this:

timestamp="03/12/2022 22:02:27.489" func=bundb.queryHook.AfterQuery level=WARN duration="3.031443043s" query="SELECT \"account\".\"id\", \"account\".\"created_at\", \"account\".\"updated_at\", \"account\".\"username\", \"account\".\"domain\", \"account\".\"avatar_media_attachment_id\", \"account\".\"avatar_remote_url\", \"account\".\"header_media_attachment_id\", \"account\".\"header_remote_url\", \"account\".\"display_name\", \"account\".\"emojis\", \"account\".\"fields\", \"account\".\"note\", \"account\".\"note_raw\", \"account\".\"memorial\", \"account\".\"also_known_as\", \"account\".\"moved_to_account_id\", \"account\".\"bot\", \"account\".\"reason\", \"account\".\"locked\", \"account\".\"discoverable\", \"account\".\"privacy\", \"account\".\"sensitive\", \"account\".\"language\", \"account\".\"status_format\", \"account\".\"custom_css\", \"account\".\"uri\", \"account\".\"url\", \"account\".\"last_webfingered_at\", \"account\".\"inbox_uri\", \"account\".\"shared_inbox_uri\", \"account\".\"outbox_uri\", \"account\".\"following_uri\", \"account\".\"followers_uri\", \"account\".\"featured_collection_uri\", \"account\".\"actor_type\", \"account\".\"private_key\", \"account\".\"public_key\", \"account\".\"public_key_uri\", \"account\".\"sensitized_at\", \"account\".\"silenced_at\", \"account\".\"suspended_at\", \"account\".\"hide_collections\", \"account\".\"suspension_origin\", \"account\".\"enable_rss\", \"avatar_media_attachment\".\"id\" AS \"avatar_media_attachment__id\", \"avatar_media_attachment\".\"created_at\" AS \"avatar_media_attachment__created_at\", \"avatar_media_attachment\".\"updated_at\" AS \"avatar_media_attachment__updated_at\", \"avatar_media_attachment\".\"status_id\" AS \"avatar_media_attachment__status_id\", \"avatar_media_attachment\".\"url\" AS \"avatar_media_attachment__url\", \"avatar_media_attachment\".\"remote_url\" AS \"avatar_media_attachment__remote_url\", \"avatar_media_attachment\".\"type\" AS \"avatar_media_attachment__type\", \"avatar_media_attachment\".\"original_width\" AS \"avatar_media_attachment__original_width\", \"avatar_media_attachment\".\"original_height\" AS \"avatar_media_attachment__original_height\", \"avatar_media_attachment\".\"original_size\" AS \"avatar_media_attachment__original_size\", \"avatar_media_attachment\".\"original_aspect\" AS \"avatar_media_attachment__original_aspect\", \"avatar_media_attachment\".\"small_width\" AS \"avatar_media_attachment__small_width\", \"avatar_media_attachment\".\"small_height\" AS \"avatar_media_attachment__small_height\", \"avatar_media_attachment\".\"small_size\" AS \"avatar_media_attachment__small_size\", \"avatar_media_attachment\".\"small_aspect\" AS \"avatar_media_attachment__small_aspect\", \"avatar_media_attachment\".\"focus_x\" AS \"avatar_media_attachment__focus_x\", \"avatar_media_attachment\".\"focus_y\" AS \"avatar_media_attachment__focus_y\", \"avatar_media_attachment\".\"account_id\" AS \"avatar_media_attachment__account_id\", \"avatar_media_attachment\".\"description\" AS \"avatar_media_attachment__description\", \"avatar_media_attachment\".\"scheduled_status_id\" AS \"avatar_media_attachment__scheduled_status_id\", \"avatar_media_attachment\".\"blurhash\" AS \"avatar_media_attachment__blurhash\", \"avatar_media_attachment\".\"processing\" AS \"avatar_media_attachment__processing\", \"avatar_media_attachment\".\"file_path\" AS \"avatar_media_attachment__file_path\", \"avatar_media_attachment\".\"file_content_type\" AS \"avatar_media_attachment__file_content_type\", \"avatar_media_attachment\".\"file_file_size\" AS \"avatar_media_attachment__file_file_size\", \"avatar_media_attachment\".\"file_updated_at\" AS \"avatar_media_attachment__file_updated_at\", \"avatar_media_attachment\".\"thumbnail_path\" AS \"avatar_media_attachment__thumbnail_path\", \"avatar_media_attachment\".\"thumbnail_content_type\" AS \"avatar_media_attachment__thumbnail_content_type\", \"avatar_media_attachment\".\"thumbnail_file_size\" AS \"avatar_media_attachment__thumbnail_file_size\", \"avatar_media_attachment\".\"thumbnail_updated_at\" AS \"avatar_media_attachment__thumbnail_updated_at\", \"avatar_media_attachment\".\"thumbnail_url\" AS \"avatar_media_attachment__thumbnail_url\", \"avatar_media_attachment\".\"thumbnail_remote_url\" AS \"avatar_media_attachment__thumbnail_remote_url\", \"avatar_media_attachment\".\"avatar\" AS \"avatar_media_attachment__avatar\", \"avatar_media_attachment\".\"header\" AS \"avatar_media_attachment__header\", \"avatar_media_attachment\".\"cached\" AS \"avatar_media_attachment__cached\", \"header_media_attachment\".\"id\" AS \"header_media_attachment__id\", \"header_media_attachment\".\"created_at\" AS \"header_media_attachment__created_at\", \"header_media_attachment\".\"updated_at\" AS \"header_media_attachment__updated_at\", \"header_media_attachment\".\"status_id\" AS \"header_media_attachment__status_id\", \"header_media_attachment\".\"url\" AS \"header_media_attachment__url\", \"header_media_attachment\".\"remote_url\" AS \"header_media_attachment__remote_url\", \"header_media_attachment\".\"type\" AS \"header_media_attachment__type\", \"header_media_attachment\".\"original_width\" AS \"header_media_attachment__original_width\", \"header_media_attachment\".\"original_height\" AS \"header_media_attachment__original_height\", \"header_media_attachment\".\"original_size\" AS \"header_media_attachment__original_size\", \"header_media_attachment\".\"original_aspect\" AS \"header_media_attachment__original_aspect\", \"header_media_attachment\".\"small_width\" AS \"header_media_attachment__small_width\", \"header_media_attachment\".\"small_height\" AS \"header_media_attachment__small_height\", \"header_media_attachment\".\"small_size\" AS \"header_media_attachment__small_size\", \"header_media_attachment\".\"small_aspect\" AS \"header_media_attachment__small_aspect\", \"header_media_attachment\".\"focus_x\" AS \"header_media_attachment__focus_x\", \"header_media_attachment\".\"focus_y\" AS \"header_media_attachment__focus_y\", \"header_media_attachment\".\"account_id\" AS \"header_media_attachment__account_id\", \"header_media_attachment\".\"description\" AS \"header_media_attachment__description\", \"header_media_attachment\".\"scheduled_status_id\" AS \"header_media_attachment__scheduled_status_id\", \"header_media_attachment\".\"blurhash\" AS \"header_media_attachment__blurhash\", \"header_media_attachment\".\"processing\" AS \"header_media_attachment__processing\", \"header_media_attachment\".\"file_path\" AS \"header_media_attachment__file_path\", \"header_media_attachment\".\"file_content_type\" AS \"header_media_attachment__file_content_type\", \"header_media_attachment\".\"file_file_size\" AS \"header_media_attachment__file_file_size\", \"header_media_attachment\".\"file_updated_at\" AS \"header_media_attachment__file_updated_at\", \"header_media_attachment\".\"thumbnail_path\" AS \"header_media_attachment__thumbnail_path\", \"header_media_attachment\".\"thumbnail_content_type\" AS \"header_media_attachment__thumbnail_content_type\", \"header_media_attachment\".\"thumbnail_file_size\" AS \"header_media_attachment__thumbnail_file_size\", \"header_media_attachment\".\"thumbnail_updated_at\" AS \"header_media_attachment__thumbnail_updated_at\", \"header_media_attachment\".\"thumbnail_url\" AS \"header_media_attachment__thumbnail_url\", \"header_media_attachment\".\"thumbnail_remote_url\" AS \"header_media_attachment__thumbnail_remote_url\", \"header_media_attachment\".\"avatar\" AS \"header_media_attachment__avatar\", \"header_media_attachment\".\"header\" AS \"header_media_attachment__header\", \"header_media_attachment\".\"cached\" AS \"header_media_attachment__cached\" FROM \"accounts\" AS \"account\" LEFT JOIN \"media_attachments\" AS \"avatar_media_attachment\" ON (\"avatar_media_attachment\".\"id\" = \"account\".\"avatar_media_attachment_id\") LEFT JOIN \"media_attachments\" AS \"header_media_attachment\" ON (\"header_media_attachment\".\"id\" = \"account\".\"header_media_attachment_id\") WHERE (\"account\".\"uri\" = 'https://pleroma.what.re/internal/fetch')" msg="SLOW DATABASE QUERY"

timestamp="03/12/2022 22:02:27.494" func=bundb.queryHook.AfterQuery level=WARN duration="3.040551092s" query="SELECT \"account\".\"id\", \"account\".\"created_at\", \"account\".\"updated_at\", \"account\".\"username\", \"account\".\"domain\", \"account\".\"avatar_media_attachment_id\", \"account\".\"avatar_remote_url\", \"account\".\"header_media_attachment_id\", \"account\".\"header_remote_url\", \"account\".\"display_name\", \"account\".\"emojis\", \"account\".\"fields\", \"account\".\"note\", \"account\".\"note_raw\", \"account\".\"memorial\", \"account\".\"also_known_as\", \"account\".\"moved_to_account_id\", \"account\".\"bot\", \"account\".\"reason\", \"account\".\"locked\", \"account\".\"discoverable\", \"account\".\"privacy\", \"account\".\"sensitive\", \"account\".\"language\", \"account\".\"status_format\", \"account\".\"custom_css\", \"account\".\"uri\", \"account\".\"url\", \"account\".\"last_webfingered_at\", \"account\".\"inbox_uri\", \"account\".\"shared_inbox_uri\", \"account\".\"outbox_uri\", \"account\".\"following_uri\", \"account\".\"followers_uri\", \"account\".\"featured_collection_uri\", \"account\".\"actor_type\", \"account\".\"private_key\", \"account\".\"public_key\", \"account\".\"public_key_uri\", \"account\".\"sensitized_at\", \"account\".\"silenced_at\", \"account\".\"suspended_at\", \"account\".\"hide_collections\", \"account\".\"suspension_origin\", \"account\".\"enable_rss\" FROM \"accounts\" AS \"account\" WHERE (\"username\" = 'eric') AND (\"domain\" = 'gts.ericcredeur.com')" msg="SLOW DATABASE QUERY"

I'm mostly inclined to treat these as a red herring (or a symptom/exacerbator rather than a cause). These are the two query types that show up when the going gets tough, though.

I'd guess once a post "goes somewhere", a ton of new accounts show up to do fetches, miss any cache, need to check the DB, need to be dereferenced in either a limited or full capacity, etc. Bad times. To support this, here's the log hit from when the GtS official account replied to me today:

timestamp="03/12/2022 22:02:20.919" func=router.loggingMiddleware.func1 level=INFO latency="4.176624ms" clientIP="213.108.108.81" userAgent="gotosocial (+https://gts.superseriousbusiness.org) gotosocial/0.6.0-rc3 git-bc917a4" method="POST" statusCode=200 path="/users/trysdyn/inbox" msg="OK: wrote 23B"

Now how many times in that minute I needed to check a main-key for an account...

[root@electric gotosocial]# grep '03/12/2022 22:02' access.log  | grep 'performing request' | grep main-key | wc -l
739

That's a lot (and does align with that nasty spike in my monitoring graph above; log's in UTC, chart's in PST). I think this is where the problem is. I wouldn't experience this a second time if gotosocial's account were to interact with me again, but it might be worthy of concern especially since a post going boost-viral would have an extremely unpredictable path through the fediverse and miss local cache and DB often no matter its shape.

trysdyn avatar Dec 03 '22 23:12 trysdyn

Thank you for the detailed report! Much appreciated! It's sort-of-known that ram use spikes when a post gets boosted or replied to by an account with 1k or more followers, but it's good to have this open so that we can track this and try to mitigate it.

tsmethurst avatar Dec 04 '22 11:12 tsmethurst

it feels like a few of those requests that make up a response to a hugely popular post shouldn't be going to the database to begin with…

igalic avatar Dec 04 '22 11:12 igalic

I know this is expected due to how interactions federating interaction context works. I'm not sure anything can even be done and still have GtS handle this mode of interaction properly.

On the other hand... SwiftOnSecurity with 28,000 followers just boosted one of my posts and GtS has been bouncing up and down like a trampoline getting repeatedly OOMkilled. :)

I can expand the VPS of course. I'm pretty sure this woulda killed my old Pleroma instance too, and GtS on an rPI unless it was the 8gb model maybe. This represents probably the worst-case for the heckstorm one can expect if something gets boosted wildly.

trysdyn avatar Dec 05 '22 21:12 trysdyn

What size is your VPS? I have an instance running on a 2GB VPS instance and aside from some previous bugs, has never managed to be OOM killed. And that is even after a post of mine today did the rounds and is at > 220 boosts

NyaaaWhatsUpDoc avatar Dec 05 '22 22:12 NyaaaWhatsUpDoc

When the incident occurred it was a 1gb VPS with 1.5gb of swap I snapped on to get through the build process.

I've bumped it up to 2gb now. Which if that's the targeted requirements for running GtS, that's cool; but I ran Pleroma in the 1gb VPS for years without a problem so it's a tiny bit of a surprise to me that GtS baselines much leaner but grows much heavier under pressure from interactions like this.

Different design outlooks, different languages; I'm not saying GtS is broken because it's acting differently here. I'm more trying to find an expected baseline for requirements and if y'all find that I can crater a GtS install in this state a problem. Sorry if I seem pushy about it.

trysdyn avatar Dec 05 '22 22:12 trysdyn

No no this is all very useful, thank you! It's definitely something we want to improve :)

NyaaaWhatsUpDoc avatar Dec 05 '22 22:12 NyaaaWhatsUpDoc

SwiftOnSecurity with 28,000 followers just boosted one of my posts and GtS has been bouncing up and down like a trampoline getting repeatedly OOMkilled

Oh shit :D

I'm more trying to find an expected baseline for requirements

We're still trying to figure that out too ;)

This isn't pushy; as kim said, it's very useful!

tsmethurst avatar Dec 06 '22 11:12 tsmethurst

First time poking around in the GTS internals, so apologies in advance if any of this is utter nonsense. It should definitely be possible to solve this problem though - I don't think we're cursed to get OOM killed any time a post gets popular!

The root issue is that none of the libraries involved in HTTP routing and serving on GTS enforce any sort of request queuing or parallelism limits; the number of goroutines the HTTP server backend starts to attempt to serve all incoming requests is not bounded. Typically for a webserver this is fine - highly spiky load is quite unusual. But in the case of stuff like an ActivityPub post being boosted by a big account, we immediately trigger a herd (heh) of similar requests at almost the same time and with no warning. As each one comes in, the processing stack yields every time IO happens (read from incoming socket, r/w to DB, write to outgoing socket), at which point the router picks up the next incoming request and spawns a goroutine for it... Every one of those requests will sit concurrently in memory while the database churns away. That many parallel DB requests tend to backlog as well, which increases the problem. I don't know how much of this is cached because I haven't read the caching code yet. However, anything that should be cached probably isn't in the cache when the herd hits, and won't be until the first request is finished. We'll then only serve requests that start processing after the first has finished from the cache anyway.

Ideally, an efficient solution would involve caching everything and marking "in-flight" cache entries when a request starts processing, such that subsequent requests wait for the first one to finish and then return the cached value. But there's probably quite a lot of engineering involved there, and not all requests will be cacheable anyway.

A simpler approach might be adding HTTP middleware that queues, or simply rejects, (expensive?) requests over a (configurable) parallelisation limit. There'd be some memory overhead from waiting requests, but it ought to be lower. There'd need to be some testing to figure out what a sensible limit for a given amount of resource is. (If that's not sufficient, you'd have to get far enough down the stack to queue things before accept(), so you don't get the overhead of storing the parsed HTTP request, plus TLS context if you're not using a reverse proxy. I don't know how practical that would be.)

If you're using nginx as a reverse proxy, you can achieve the "rejects" version of that with something like

set $const 1;
limit_conn_zone $const zone=gts:1m;

location / {
    limit_conn gts 10;
    proxy_pass  http://whatever;
    ...

which will 503 requests if it has more than 10 requests to GTS in flight already. You could further constrain the location block to only apply limit_conn to paths identified as the ones that get thrashed when a post gets popular. (This does rely on remote servers retrying after a bit if they get 503ed and not just giving up.)

(Edit to add: not to rule out other efficiency gains we might make in request processing that would also help!)

Sentynel avatar Dec 06 '22 13:12 Sentynel

We do actually have rate limiting middleware in place which should serve to alleviate much of this. It simply needs enabling via config.

On Tue, 6 Dec 2022, 13:43 Sam Lade, @.***> wrote:

First time poking around in the GTS internals, so apologies in advance if any of this is utter nonsense. It should definitely be possible to solve this problem though - I don't think we're cursed to get OOM killed any time a post gets popular!

The root issue is that none of the libraries involved in HTTP routing and serving on GTS enforce any sort of request queuing or parallelism limits; the number of goroutines the HTTP server backend starts to attempt to serve all incoming requests is not bounded. Typically for a webserver this is fine - highly spiky load is quite unusual. But in the case of stuff like an ActivityPub post being boosted by a big account, we immediately trigger a herd (heh) of similar requests at almost the same time and with no warning. As each one comes in, the processing stack yields every time IO happens (read from incoming socket, r/w to DB, write to outgoing socket), at which point the router picks up the next incoming request and spawns a goroutine for it... Every one of those requests will sit concurrently in memory while the database churns away. That many parallel DB requests tend to backlog as well, which increases the problem. I don't know how much of this is cached because I haven't read the caching code yet. However, anything that should be cached probably isn't in the cache when the herd hits, and won't be until the first request is finished. We'll then only serve requests that start processing after the first has finished from the cache anyway.

Ideally, an efficient solution would involve caching everything and marking "in-flight" cache entries when a request starts processing, such that subsequent requests wait for the first one to finish and then return the cached value. But there's probably quite a lot of engineering involved there, and not all requests will be cacheable anyway.

A simpler approach might be adding HTTP middleware that queues, or simply rejects, (expensive?) requests over a (configurable) parallelisation limit. There'd be some memory overhead from waiting requests, but it ought to be lower. There'd need to be some testing to figure out what a sensible limit for a given amount of resource is. (If that's not sufficient, you'd have to get far enough down the stack to queue things before accept(), so you don't get the overhead of storing the parsed HTTP request, plus TLS context if you're not using a reverse proxy. I don't know how practical that would be.)

If you're using nginx as a reverse proxy, you can achieve the "rejects" version of that with something like

set $const 1; limit_conn_zone $const zone=gts:1m;

location / { limit_conn gts 10; proxy_pass http://whatever; ...

which will 503 requests if it has more than 10 requests to GTS in flight already. You could further constrain the location block to only apply limit_conn to paths identified as the ones that get thrashed when a post gets popular. (This does rely on remote servers retrying after a bit if they get 503ed and not just giving up.)

— Reply to this email directly, view it on GitHub https://github.com/superseriousbusiness/gotosocial/issues/1205#issuecomment-1339408087, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVLN7HDDBVNDGVWWBMPJX4LWL47IXANCNFSM6AAAAAASTAFAG4 . You are receiving this because you commented.Message ID: @.***>

NyaaaWhatsUpDoc avatar Dec 06 '22 14:12 NyaaaWhatsUpDoc

I did have a bit of a look at that. Right now it's bucketing by IP address, which isn't very useful here, but you're right, there's no reason we couldn't add another rate limiter that doesn't bucket so it enforces a global rate limit. I'd be a little worried that it will be more difficult to tune than a concurrency limit, because I suspect there's a lot of variability between requests in how long processing takes, and that a flood of near-identical requests is particularly bad for performance. It sounds fiddly, but maybe something like a global limit with a short window applied to only the request categories that happen when a big account boosts could work. Do any of the people this has happened to have more logs they could share in terms of how fast the requests hit after a big account boosts vs normal traffic flow?

We do actually have rate limiting middleware in place which should serve to alleviate much of this. It simply needs enabling via config.

Sentynel avatar Dec 06 '22 15:12 Sentynel

Do any of the people this has happened to have more logs they could share in terms of how fast the requests hit after a big account boosts vs normal traffic flow?

I went digging in my Prometheus metrics. Caddy exposes how many requests it's proxying at any given time; so I don't think we need to process logs here (but I can go log diving tonight if we need). I was able to figure out a few important breakpoints:

  • In normal use when my posts aren't being boosted caddy_requests_in_flight doesn't exceed 10. Surely because by now I've dereferenced every follower-of-follower account via natural progression of federation and we just serve those fetches instantly.
  • The biggest blip that didn't result in GtS disruption was 125 open requests at once. GtS handled this fine, even on my 1GB VPS with no RAM tuning and Postgres DB.
  • GtS's official account replied to a post of mine, sparking a 1400 follower rush. Caddy forwarded about 550 open requests at once before GtS went down. Once GtS was restored, the requests came in smoother as retries staggered.
  • The Taylor Swift Hug of Death ™️ topped out at 2100 requests in flight . My instance bounced 4 or 5 times. It looks like here the request in flight counts that killed GtS were 2100, 750, 550, and 500 before retries and lagging requests staggered enough to stop the herd.

These request spikes were instant, so the 550 and 2100 requests that came in and choked my box all arrived in one metric gathering tick: 15 seconds.

I'd ballpark on the most constrained of "reasonable" loadouts (a $5 VPS using Postgres with nothing else running on it and no RAM tuning), we'd start seeing problems around 250 requests in flight and could document "Hey if you're doing this on a 1GB cramped box, configure this limit in your terminator".

At 250 req per 500mb of free RAM, a 2GB instance like Kim is running should be able to handle 700+, which likely shoulders a first-time interaction from the GtS account. The SwiftOnSecurity boost would be a problem for practically anything though.

trysdyn avatar Dec 06 '22 20:12 trysdyn

That's great data. It sounds like a limit enforced by the proxy of "500-ish in-flight requests per GB of free RAM" should be a good starting point for surviving this sort of event.

If possible, can you also get the total number of requests handled and the latency figures for the 125-request spike it survived? If we can calculate a rough peak survivable rate from that, Kim's suggestion of enforcing a global rate limit via the existing middleware would be an option for deployments without a reverse proxy.

Sentynel avatar Dec 06 '22 21:12 Sentynel

Thanks for gathering this data :)

A simpler approach might be adding HTTP middleware that queues, or simply rejects, (expensive?) requests over a (configurable) parallelisation limit

So throttling basically? :) There are a couple of throttling middlewares kicking around for gin, if I remember rightly, and we could also write our own fairly easily I think.

an efficient solution would involve caching everything and marking "in-flight" cache entries when a request starts processing, such that subsequent requests wait for the first one to finish and then return the cached value. But there's probably quite a lot of engineering involved there, and not all requests will be cacheable anyway.

The way we do things currently with our caches should aaalmost do this, I believe, since they're read/write locked with mutexes. There are almost certainly still cases where concurrent requests cause duplicate db lookups though.

Perhaps it's worth making an issue to investigate throttling middleware and see if that can alleviate enough of the issues for now? What do you all think?

[edit] Also, and aside from this, I think a lot of slowdown is caused by dereferencing + processing of remote headers and avatars. Perhaps we should also investigate limiting how many concurrent avis + headers can be dereferenced at once. For example, if you look at this part of the code, you can see that we actually store thumbnails in memory briefly instead of streaming their conversion process, which could also be responsible for ram spikes: https://github.com/superseriousbusiness/gotosocial/blob/199b685f430910910e43476caa9ccec6a441d020/internal/media/image.go#L170-L177

tsmethurst avatar Dec 06 '22 21:12 tsmethurst

So throttling basically? :) There are a couple of throttling middlewares kicking around for gin, if I remember rightly, and we could also write our own fairly easily I think.

chi ships with a throttling middleware that looks like it would do what we want (you can set it to allow concurrent processing of n requests, backlog another m requests, and drop any beyond that) and be reasonably easy to adapt: https://github.com/go-chi/chi/blob/master/middleware/throttle.go I couldn't immediately find any already built for gin that limit concurrency rather than using a time-based rate limit, though that doesn't mean they don't exist! Alternatively, we could test a global time-based rate limit with the existing middleware.

[edit] Also, and aside from this, I think a lot of slowdown is caused by dereferencing + processing of remote headers and avatars. Perhaps we should also investigate limiting how many concurrent avis + headers can be dereferenced at once. For example, if you look at this part of the code, you can see that we actually store thumbnails in memory briefly instead of streaming their conversion process, which could also be responsible for ram spikes:

Probably worth spending some time with the profiler to try and identify stuff like this (I see it does serve /debug/pprof on debug builds). Do we have traffic generation scripts for this sort of testing already?

Sentynel avatar Dec 06 '22 23:12 Sentynel

If possible, can you also get the total number of requests handled and the latency figures for the 125-request spike it survived? If we can calculate a rough peak survivable rate from that, Kim's suggestion of enforcing a global rate limit via the existing middleware would be an option for deployments without a reverse proxy.

During the 125 in-flight request spike, full round trip times for 200 OK responses were 0.2s at median and 8.2s at 95th percentile (over a 60s window). The 95th meshes up well with some slow query warnings I got in logs expressing 7+ second query times. It looks like in total during the hottest part of the spike GtS served about 4,500 requests over a 2 minute window.

That was a post that got boosted a lot, but by a bunch of smaller accounts. I honestly don't even remember what I said at this point. But yeah, GtS survived that burst.

Given the wild variance on 50th vs 95th, I suspect that's some query in here that's slower than others, but it may be premature to try to optimize that deep. I think if we just ask instances to try again later (albeit rudely with a 503) at the redline either via documented terminator behavior or internal handling, we can call that robust enough for an alpha.

trysdyn avatar Dec 07 '22 00:12 trysdyn

Good data. If it's just a few outlier requests that are taking ages to process, I don't think rate limiting is going to help. So I agree, 503ing above a defined concurrency limit is probably the way to go. That's a start for the alpha, but I think it's probably also a good backstop even if we clear out some of the issues - if a big enough account boosts a small enough server, it's gonna hit resource limits eventually.

I am curious about those outliers though - hitting 40 times slower at the 95th percentile is a lot. The first DB query in your example has a few joins going on, but the second is just a straight SELECT, which really shouldn't be slow unless the database is getting absolutely thrashed. I don't suppose your metrics or logs tell you which specific API endpoints are getting the really slow responses?

Sentynel avatar Dec 07 '22 09:12 Sentynel

Do we have traffic generation scripts for this sort of testing already?

No not yet; we haven't done performance or load testing beyond informal stuff.

the second is just a straight SELECT, which really shouldn't be slow unless the database is getting absolutely thrashed

Or an index is missing somewhere :P Which could well be the case. We've been optimizing db queries over time one by one, but there are likely still some slow ones that are not indexed appropriately.

Also, we've got an issue open for removing those excessive joins so we can use caches instead. We already did a few but there's more to do: https://github.com/superseriousbusiness/gotosocial/issues/1094

tsmethurst avatar Dec 07 '22 13:12 tsmethurst

How much of this traffic could be alleviated with a short-term HTTP cache, e.g. https://github.com/caddyserver/cache-handler? Can the webfinger response, toot JSON, generated thumbnails, etc be cached for a few minutes to better handle a thundering herd with a Cache-Control: public, max-age=300 header?

mperham avatar Dec 07 '22 19:12 mperham

Mm might be worth a try. We briefly tried caching some json responses in memory in GtS last year or so, but back then our cache implementation (read: something I just hacked together) was not good. It's a lot better now (because we're no longer using the thing I just hacked together), so it might pay off to try caching more ready-to-go responses in memory. Not sure. Ideally (imo) we'd do an initial version of this without involving a reverse proxy at all, to simplify things...

tsmethurst avatar Dec 07 '22 19:12 tsmethurst

I have unfortunately discovered that getting a single reply from a particular high-follower-count (~10,000) account will effectively make my GoToSocial instance and everything else on my small (1GiB RAM) VPS unusable. I can't even SSH in, so I'm forced to power-cycle the VPS! I think that's because many (dozens? hundreds?) of instances make a bunch of requests at the same moment and overwhelm the server.

It's a bit surprising that the impact is this bad, though. I think the big problem might be the disk I/O? I'm not sure why that spikes? Is it the logging possibly? I'm pretty sure I don't have a swapfile set up…

I'm running GTS behind nginx so I might experiment a bit with a concurrent connections limit there. Is there any other potential quick-fix with nginx? For example, can I get nginx to do caching for me without much configuration?

hikari-no-yume avatar Dec 16 '22 21:12 hikari-no-yume

mm, it shouldn't be the logging, likely it's the amounts of accounts your instance has to go and dereference when it suddenly gets all those requests to view the status that was just replied to: that's a lot of public keys to fetch very quickly. we're working on a couple things to alleviate this right now, which we'll likely include in 0.7.0 in the new year, so hopefully that'll help :)

it is odd that it makes your instance totally unusable though; i also run on a 1gb vps (with 1 cpu) and haven't seen that level of performance degradation myself... might be worth checking with your vps to see what kind of disk they use?

tsmethurst avatar Dec 18 '22 09:12 tsmethurst

Opened a throttling PR (see above) that relates to this issue

tsmethurst avatar Jan 03 '23 11:01 tsmethurst

To update: #1297 is a massive help. I haven't had my instance turn into a smoking crater since and I've had a couple of pretty big boosts since then.

I'm satisfied from the perspective of "I am deliberately running on the smallest system you would reasonably try to support". I know there was some talk about tuning up some of the retrieval pathways we identified as RAM hungry or tying up workers so I won't close.

trysdyn avatar Jan 24 '23 12:01 trysdyn

Great, I'm glad things work OK for now :)

tsmethurst avatar Jan 26 '23 14:01 tsmethurst

I think we can safely close this one for now and open up more specific issues for tuning as we move on.

tsmethurst avatar Jul 31 '23 16:07 tsmethurst