Scheduled autoscaling scales down to default instance_min_count around midnight
Using the following recurring schedule our application scales up at 00:00 on the 19th day of the month. We want to keep running with that same amount of instances until the end of the 27th day of the month. And we want to do that each month.
But what happens is that every night, at around 00:06 we have autoscaling events. We think this problem is due to the 00:00-23:59 start_time/end_time configuration which is round down to the minute. So, basically there is one minute between 23:59 and 0:00 where the autoscaler "thinks" there is no recurring schedule. If not mistaken, 00:06 makes sense due to due cool down/breach settings.
Adding seconds, like 23:59:59 isn't allowed according to the API. There also doesn't seem to be an option to configure it like "from the 19th day 00:00 until the 27th day 23:59.
Any suggestions?
{
"instance_min_count": 28,
"instance_max_count": 36,
"scaling_rules": [
{
"metric_type": "throughput",
"breach_duration_secs": 300,
"threshold": 60,
"operator": "<",
"cool_down_secs": 300,
"adjustment": "-1"
},
{
"metric_type": "throughput",
"breach_duration_secs": 60,
"threshold": 90,
"operator": ">=",
"cool_down_secs": 60,
"adjustment": "+1"
}
],
"schedules": {
"timezone": "Europe/Amsterdam",
"recurring_schedule": [
{
"start_time": "00:00",
"end_time": "23:59",
"days_of_month": [
19,
20,
21,
22,
23,
24,
25,
26,
27
],
"instance_min_count": 36,
"instance_max_count": 36
}
]
}
}
Hi martinvisser, you've identified an intriguing corner case. To aide the investigation, can you please send us the complete scaling histories of the days 19 and 20. Kind regards, Susanne
These are the events from the last days:
2024-11-21T00:00:44.00+0100 audit.app.process.ready web index: 35, cell_id: 2d776afa-ce2e-4f06-bafd-e79a4a7c5a45, instance: 2ef3ab10-ba1f-4d29-6c36-ae7f
2024-11-21T00:00:01.00+0100 audit.app.process.scale instances: 36
2024-11-20T23:59:53.00+0100 audit.app.process.scale instances: 35
2024-11-20T00:00:45.00+0100 audit.app.process.ready web index: 35, cell_id: 48e574ba-10a7-4d35-8236-95aec1673ac1, instance: ffff8f4d-c933-405c-4cbf-3869
2024-11-20T00:00:01.00+0100 audit.app.process.scale instances: 36
2024-11-19T23:59:53.00+0100 audit.app.process.scale instances: 35
@martinvisser could you please share the relevant output of cf autoscaling-history that'd be awesome :)
https://github.com/cloudfoundry/app-autoscaler-cli-plugin?tab=readme-ov-file#examples-5
This is from this night:
Scaling Type Status Instance Changes Time Action Error
scheduled succeeded 35->36 2024-11-22T00:00:00+01:00 +1 instance(s) because limited by min instances 36
dynamic succeeded 36->35 2024-11-21T23:59:52+01:00 -1 instance(s) because throughput < 60rps for 300 seconds
Thanks, that is very helpful to ensure that our assumptions are aligned with the observed behaviour on your side.
As said, you identified a very nice corner case, congratulations 👼 👍.
@silvestre any insights that you'd like to share when it comes to the priority of improving here?
Haha, thanks, I guess 🤣
Thank you for taking the time to bring this to our attention.
We truly appreciate your detailed observations and feedback. You are absolutely correct in identifying the missing functionality.
At present, our direct users do not heavily rely on the schedules feature, which means we are unable to prioritize this issue immediately. However, we genuinely value community contributions, and if you or anyone else is interested in addressing this, we warmly welcome any pull requests. Rest assured, we have noted this issue and hope to revisit it when circumstances allow.
Thank you again for your understanding and support!
I was wondering if it would be enough to include seconds in the start_time/end_time fields, which are now not supported by the policy validator (policy_json.schema.json also has a regex without support for seconds).
If so, I can suggest a pull request to allow both formats. What I don't know about the plugin is what kind of impact this might have on DateTimeRange.startDateTime/DateTimeRange.endDateTime other than it having seconds added to the time.Time object.
Unfortunately, taking seconds into consideration requires more effort than just modifying the for policy validation. As @silvestre mentioned, I guess if more people express interest in this improvement, we will consider picking it up and working on it.
I understand, changes need to be made in the policy validator to allow another date time format, but in the end it is stored as time.Time object. Therefore, to me it seems the time parsing needs to be a bit more lenient, but nothing else seems to need changing. Of course, I could be (very) wrong about that assumption.
I took the liberty to create the PR, so I still hope you would be so kind to have a look even though I understand the comments above.
Unfortunately, it's not that simple. Here are just some examples so that you get a feeling:
-
https://github.com/cloudfoundry/app-autoscaler-release/blob/035a11cbbd35cdda519b48929af56579240b2003/src/scheduler/src/main/java/org/cloudfoundry/autoscaler/scheduler/util/ScheduleJobHelper.java#L117
-
https://github.com/cloudfoundry/app-autoscaler-release/blob/035a11cbbd35cdda519b48929af56579240b2003/src/scheduler/src/main/java/org/cloudfoundry/autoscaler/scheduler/util/DateHelper.java#L15
-
and of course tests need to be adjusted/added
-
allowing seconds to be added still comes the with the fact that miliseconds are not taken into account which we are not 100% sure how the system will behave in that case
Thanks for the information, I indeed missed that.