feature: scheduled scaling up
Sometimes we can predict the increase of the resource consumption before it actually happens. (like TV, push notification on app, etc. Or load testing in a upstream service in dev.) This feature allows people to schedule scaling up before it actually happens. They will configure it with "when scaling up" and "how long scaling up" so that it can be back to normal afterward.
Maybe this feature will eventually replace resourcePolicy in Tortoise. (plus, Emergency mode as well?)
- [ ] @randytqwjp Generate templates for
ScheduledScalingCRD and the controller viakubebuilder create api. - [ ] @randytqwjp Add all the necessary fields in
ScheduledScalingCRD based on the design. Please write the appropriate comments on each field, they'll be the documentation for users. No need to implement the logic yet. - [ ] @sanposhiho Elaborate the implementation part.
@sanposhiho Could you elaborate why we need a separate controller for this again? implementation wise it seems like it would be much easier to be done in tortoise controller by just changing the recommenders field. We can have a new CRD but handle the changes in the same controller
That's a good question because that leads you to the core principle of Kubernetes. Having separate APIs with separate controllers is a very common best practice in Kubernetes. For example, looking at Kubernetes controller manager (where the controllers for all k8s builtin resources are implemented), each resource has each controller. If you think of CronJob, you may think that it can/should be implemented within Job controller, but it is actually not. You can find a lot of other examples of "they look like implementable together, but they're actually not".
There're many different reasons, but the core reason is a concern separation. Tortoise is responsible for autoscaling/recommendation based on historical data. On the contrary, ScheduledScaling introduces a pre-scaling resources based on user's intent. Combining these two fundamentally different concerns tightly in the same controller increases complexity and hence the debugging difficulty, and makes the code harder to maintain for long term. (Especially, as you know, the existing Tortoise controller's logic is already quite complex.) So, we should have a clear boundary between those two APIs so that each can collaborate with each other only with APIs, without depending on the other one tightly.
This "concern separation" comes with some other benefits. One is testability: Having different APIs for each concern allows us to separate tests as well. All the tests for ScheduledScaling can be done without tortoise's logic; we'd be able to just put a ScheduledScaling and verify how Tortoise resource gets changed. Those tests don't have to check how Tortoise modify, for example, HPA actually because "when tortoise is changed, the tortoise controller would change HPA (etc) like this" would be the responsibility of tortoise controller's test, not ScheduledScaling's ones.
Also, another benefit is "limited blast radius"; Let's say there's a bug in a new logic around scheduledscaling. If we implement it in the tortoise controller, it might break the entire tortoise controller and cause a wider impact in the cluster. Every time we change something in either Tortoise or ScheduledScaling, we'd have to check both are working correctly after the change.
Also, "the future extensibility" would be yet another benefit. Even though the current ScheduledScaling is very simple, which might look easier to you to implement the current ScheduledScaling into Tortoise controller (honestly, I don't think so with even the current ScheduledScaling though), its features would get bigger and bigger for long run. For example, what if you want to expand targetRefs to have labelSelector so that a single ScheduledScaling can refer to multiple tortoises. For example, what if you want to add hpaName to targetRefs so that users who don't use Tortoise, but only use HPA can still utilize ScheduledScaling. I can think of multiple scenarios that would complicate the implementation further if we put it into the tortoise controller.
Finally, "performance". We don't have any scale testing for tortoise controller though (we should have!), it already queries a lot of stuff to kube-apiserver. So, I guess the reconciliation is not very fast even at the moment. Then, like I discussed with.. you? or someone when I was still in Mercari, if we want to implement auto-emergency mode -ish stuff, we'd probably have to query metrics (via datadog etc), which would degrade the reconcile latency further more. Putting ScheduledScaling would be additional overhead there unnecessarily.
Okay, I already explained too much of the benefit, let's see the downside as well to take a balance. I do know having another controller means having some duplicated logic around fundamental parts around reconciliation, e2e testing, etc. That's what it is. That's the cost many controllers in this world pay. ...and that's the only cost you have to pay. It's a fair deal to pay it initially and get all the benefits above.