mastodon-on-aws icon indicating copy to clipboard operation
mastodon-on-aws copied to clipboard

Enable auto-scaling for web and streaming API

Open andreaswittig opened this issue 3 years ago • 10 comments

Evaluate and implement auto-scaling for ECS services web and streaming API.

andreaswittig avatar Nov 16 '22 20:11 andreaswittig

+1 for this feature

(some documentation of best practices on manual scale up process would be nice too)

scrappydog avatar Nov 24 '22 16:11 scrappydog

Based on several days of working with the three services, one can do an HA and auto-scaling configuration out of the box if one sets AutoScaling to true, and sets the DesiredCount, MaxCapacity, and MinCapacity. The only service that doesn't scale well is the sidekiq service. According to this page https://docs.joinmastodon.org/admin/scaling/#sidekiq, you can have multiple sidekiq services on different queues, except for the scheduler queue. There can only be one of those. My fork has a few of these changes already in the istoleyourpw-deploy branch: https://github.com/compuguy/mastodon-on-aws

Edit: Came across this article (https://nora.codes/post/scaling-mastodon-in-the-face-of-an-exodus/), it explains how to split up the sidekiq tasks. Can have multiple instances with the default, push, and pull queues, and have one instance for mailer and scheduler.

compuguy avatar Nov 25 '22 02:11 compuguy

My Sidekiq task is regularly pegging at 100% CPU utilization... definitely need some guidance on configuring scaling...

scrappydog avatar Nov 29 '22 18:11 scrappydog

@scrappydog Same for us. I'm not sure if that is an issue. It likely doesn't matter if the background tasks utilize all resources as long as they finish withou much delay. For us, we see spikes to 100% but only for minutes. Do you see the same pattern? Screenshot 2022-11-28 at 09 42 10

michaelwittig avatar Nov 29 '22 18:11 michaelwittig

That looks very similar to utilization on my instance.

My inner system admin really "wants" to add another task... but I agree as long as jobs are completing in a reasonable time it's not an immediate issue.

BUT we are running tiny instances for testing... we NEED a way to scale up... :-)

scrappydog avatar Nov 29 '22 18:11 scrappydog

I bumped the CPU allocation up on the Sidekiq task to CPU .5 vCPU | Memory 3 GB...

This feels happier for now... but it doesn't address the real scalability question...

scrappydog avatar Nov 29 '22 22:11 scrappydog

image

Upgraded about half way through this graph... definably a lot better!

scrappydog avatar Nov 30 '22 13:11 scrappydog

I opened up #20 for sidekiq. This issue is about auto-scaling for web and streaming API.

Enabling auto-scaling is not the big deal here. What we need is a good metric to trigger scale out/in. And we need a test workload to test tis with. I have no idea how we can simulate mastodon load. If anyone here is reading this running an instance with enough users to benefit rom auto-scaling please let us know.

michaelwittig avatar Dec 02 '22 12:12 michaelwittig

Just add a relay server and you will have CPU load in a minute.

https://github.com/brodi1/activitypub-relays

nodomain avatar Dec 04 '22 08:12 nodomain

I opened up #20 for sidekiq. This issue is about auto-scaling for web and streaming API.

Enabling auto-scaling is not the big deal here. What we need is a good metric to trigger scale out/in. And we need a test workload to test tis with. I have no idea how we can simulate mastodon load. If anyone here is reading this running an instance with enough users to benefit rom auto-scaling please let us know.

Yeah it's quite easy to autoscale the web and streaming API's. But for most people it's #20 that's more important since Sidekiq does most of the heavy lifting for Mastodon...

compuguy avatar Dec 04 '22 20:12 compuguy