sidekiq
sidekiq copied to clipboard
Kubernetes documentation
I've started a wiki page to collect operational info on running Sidekiq in Kubernetes but I don't have any direct experience with k8s! If you have knowledge on this particular topic, please help out by writing down your knowledge. I've created a basic structure but please modify and add topics as necessary.
https://github.com/mperham/sidekiq/wiki/Kubernetes
Took a first-pass at the "Safe Shutdown" and "Health Checks" sections. Let me know if anything looks confusing there.
I'll try to put something together that is measured for the "Autoscaling" section. Based on my personal experience, I would usually say "Don't."
Took a pass at the Autoscaling section. Tried to keep it high-level since the part about external metrics (I believe) is going to be very dependent on what cloud/metrics providers folks use (e.g. AWS, Datadog, etc.)
Took a first-pass at the "Safe Shutdown" and "Health Checks" sections. Let me know if anything looks confusing there.
What do you think about adding a startupProbe? The nice thing about a start up probe is it will disable both liveness and readiness checks until the start up probe passes. This seems like a prime spot to check for the touched file's existence. This way we don't have to battle with the initialDelaySeconds value since that's going to be very different depending on how big or small your Rails app is where as the start up probe generally solves the problem. It only runs once too.
We could still have a liveness probe for continuous checks on something else but I'm not sure about a readiness probe for a background worker? The Kubernetes docs mentions a readiness check will move failing pods out of a service's load balancer but in a background worker's case you wouldn't have a load balancer in front of it right?
Took a pass at the Autoscaling section.
I like the idea of scaling off queue latency or number of items in the queue. I haven't implemented this personally but I had similar ideas around using number of items in the queue.
One thing to probably consider would be pool size here too? If you scale up to a higher amount of workers you may run out of connections.
We use k8s HorizontalPodAutoscalers (HPAs) quite prominently in our Sidekiq deployments. The pods of those k8s deployments typically contain one Sidekiq worker with an appropriate level of concurrency. Each pod has its own DB connection pool, so if you configure the pool size correctly for one instance, all others will be fine, as long as your DB itself can handle the aggregate of connections.
We publish Prometheus metrics from the app and hook them up to k8s via external metrics. In some of our HPAs, we actually use a custom metric defined within the k8s external metric configuration that is a combination of many metrics, mostly to reduce thrashing on sidekiq queues that contain longer-running jobs. The custom metric takes into account active jobs, queue size and the actual number of pods still up for the deployment, as the k8s HPA "forgets" about the pods once k8s has signaled them to shutdown, but before the sidekiq processes have actually finished their work and shut down. This avoided a problem where, for our queues with long-running jobs, our reported Sidekiq processes were consistently above the HPA's configured max. Then, we have categorized our queues based on independent, scalable operations within the app (like onboarding or async-eventing operations), and these queues each have their own HPA/deployment with a minimum pod count setting that is geared towards low-to-mid usage for that queue.
I'm not sure how to fit that into the already-great content in the Wiki page... but I agree that it introduces a lot of complexity. However, at least for our use cases, it has become quite powerful.
Isaac, I would suggest a blog post discussing your system along with a link to it in the wiki. You are right that it doesn’t feel appropriate to put that level of complexity in the wiki directly but links to further resources are always welcome.
On Wed, Mar 30, 2022 at 17:06 Isaac Tobler @.***> wrote:
We use k8s HorizontalPodAutoscalers (HPAs) quite prominently in our Sidekiq deployments. The pods of those k8s deployments typically contain one Sidekiq worker with an appropriate level of concurrency. Each pod has its own DB connection pool, so if you configure the pool size correctly for one instance, all others will be fine, as long as your DB itself can handle the aggregate of connections.
We publish Prometheus metrics from the app and hook them up to k8s via external metrics. In some of our HPAs, we actually use a custom metric defined within the k8s external metric configuration that is a combination of many metrics, mostly to reduce thrashing on sidekiq queues that contain longer-running jobs. The custom metric takes into account active jobs, queue size and the actual number of pods still up for the deployment, as the k8s HPA "forgets" about the pods once k8s has signaled them to shutdown, but before the sidekiq processes have actually finished their work and shut down. This avoided a problem where, for our queues with long-running jobs, our reported Sidekiq processes were consistently above the HPA's configured max. Then, we have categorized our queues based on independent, scalable operations within the app (like onboarding or async-eventing operations), and these queues each have their own HPA/deployment with a minimum pod count setting that is geared towards low-to-mid usage for that queue.
I'm not sure how to fit that into what is already in the Wiki page... but I agree that it introduces a lot of complexity, but in our use cases, it has become quite powerful.
— Reply to this email directly, view it on GitHub https://github.com/mperham/sidekiq/issues/5073#issuecomment-1083818162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAWX7R2DEDCOW4ZRTZML3VCTUADANCNFSM5JCS5IBQ . You are receiving this because you authored the thread.Message ID: @.***>
Do you have a sample Github page for an autoscaled Sidekiq Kubernetes deployment?