app-autoscaler-release icon indicating copy to clipboard operation
app-autoscaler-release copied to clipboard

Scheduled autoscaling scales down to default instance_min_count around midnight

Open martinvisser opened this issue 1 year ago • 12 comments

Using the following recurring schedule our application scales up at 00:00 on the 19th day of the month. We want to keep running with that same amount of instances until the end of the 27th day of the month. And we want to do that each month.

But what happens is that every night, at around 00:06 we have autoscaling events. We think this problem is due to the 00:00-23:59 start_time/end_time configuration which is round down to the minute. So, basically there is one minute between 23:59 and 0:00 where the autoscaler "thinks" there is no recurring schedule. If not mistaken, 00:06 makes sense due to due cool down/breach settings.

Adding seconds, like 23:59:59 isn't allowed according to the API. There also doesn't seem to be an option to configure it like "from the 19th day 00:00 until the 27th day 23:59.

Any suggestions?

{
    "instance_min_count": 28,
    "instance_max_count": 36,
    "scaling_rules": [
        {
            "metric_type": "throughput",
            "breach_duration_secs": 300,
            "threshold": 60,
            "operator": "<",
            "cool_down_secs": 300,
            "adjustment": "-1"
        },
        {
            "metric_type": "throughput",
            "breach_duration_secs": 60,
            "threshold": 90,
            "operator": ">=",
            "cool_down_secs": 60,
            "adjustment": "+1"
        }
    ],
    "schedules": {
        "timezone": "Europe/Amsterdam",
        "recurring_schedule": [
            {
                "start_time": "00:00",
                "end_time": "23:59",
                "days_of_month": [
                    19,
                    20,
                    21,
                    22,
                    23,
                    24,
                    25,
                    26,
                    27
                ],
                "instance_min_count": 36,
                "instance_max_count": 36
            }
        ]
    }
}

martinvisser avatar Nov 20 '24 12:11 martinvisser

Hi martinvisser, you've identified an intriguing corner case. To aide the investigation, can you please send us the complete scaling histories of the days 19 and 20. Kind regards, Susanne

salzmannsusan avatar Nov 21 '24 08:11 salzmannsusan

These are the events from the last days:

2024-11-21T00:00:44.00+0100   audit.app.process.ready   web                                                                       index: 35, cell_id: 2d776afa-ce2e-4f06-bafd-e79a4a7c5a45, instance: 2ef3ab10-ba1f-4d29-6c36-ae7f
2024-11-21T00:00:01.00+0100   audit.app.process.scale                                                                             instances: 36
2024-11-20T23:59:53.00+0100   audit.app.process.scale                                                                             instances: 35
2024-11-20T00:00:45.00+0100   audit.app.process.ready   web                                                                       index: 35, cell_id: 48e574ba-10a7-4d35-8236-95aec1673ac1, instance: ffff8f4d-c933-405c-4cbf-3869
2024-11-20T00:00:01.00+0100   audit.app.process.scale                                                                             instances: 36
2024-11-19T23:59:53.00+0100   audit.app.process.scale                                                                             instances: 35

martinvisser avatar Nov 21 '24 08:11 martinvisser

@martinvisser could you please share the relevant output of cf autoscaling-history that'd be awesome :)

https://github.com/cloudfoundry/app-autoscaler-cli-plugin?tab=readme-ov-file#examples-5

geigerj0 avatar Nov 22 '24 07:11 geigerj0

This is from this night:

Scaling Type     	Status        	Instance Changes     	Time                          	Action                                                        	Error
scheduled        	succeeded     	35->36               	2024-11-22T00:00:00+01:00     	+1 instance(s) because limited by min instances 36
dynamic          	succeeded     	36->35               	2024-11-21T23:59:52+01:00     	-1 instance(s) because throughput < 60rps for 300 seconds

martinvisser avatar Nov 22 '24 07:11 martinvisser

Thanks, that is very helpful to ensure that our assumptions are aligned with the observed behaviour on your side.

As said, you identified a very nice corner case, congratulations 👼 👍.

@silvestre any insights that you'd like to share when it comes to the priority of improving here?

geigerj0 avatar Nov 22 '24 07:11 geigerj0

Haha, thanks, I guess 🤣

martinvisser avatar Nov 22 '24 07:11 martinvisser

Thank you for taking the time to bring this to our attention.

We truly appreciate your detailed observations and feedback. You are absolutely correct in identifying the missing functionality.

At present, our direct users do not heavily rely on the schedules feature, which means we are unable to prioritize this issue immediately. However, we genuinely value community contributions, and if you or anyone else is interested in addressing this, we warmly welcome any pull requests. Rest assured, we have noted this issue and hope to revisit it when circumstances allow.

Thank you again for your understanding and support!

silvestre avatar Nov 26 '24 17:11 silvestre

I was wondering if it would be enough to include seconds in the start_time/end_time fields, which are now not supported by the policy validator (policy_json.schema.json also has a regex without support for seconds).

If so, I can suggest a pull request to allow both formats. What I don't know about the plugin is what kind of impact this might have on DateTimeRange.startDateTime/DateTimeRange.endDateTime other than it having seconds added to the time.Time object.

martinvisser avatar Feb 18 '25 10:02 martinvisser

Unfortunately, taking seconds into consideration requires more effort than just modifying the for policy validation. As @silvestre mentioned, I guess if more people express interest in this improvement, we will consider picking it up and working on it.

geigerj0 avatar Feb 20 '25 16:02 geigerj0

I understand, changes need to be made in the policy validator to allow another date time format, but in the end it is stored as time.Time object. Therefore, to me it seems the time parsing needs to be a bit more lenient, but nothing else seems to need changing. Of course, I could be (very) wrong about that assumption. I took the liberty to create the PR, so I still hope you would be so kind to have a look even though I understand the comments above.

martinvisser avatar Feb 21 '25 08:02 martinvisser

Unfortunately, it's not that simple. Here are just some examples so that you get a feeling:

  • https://github.com/cloudfoundry/app-autoscaler-release/blob/035a11cbbd35cdda519b48929af56579240b2003/src/scheduler/src/main/java/org/cloudfoundry/autoscaler/scheduler/util/ScheduleJobHelper.java#L117

  • https://github.com/cloudfoundry/app-autoscaler-release/blob/035a11cbbd35cdda519b48929af56579240b2003/src/scheduler/src/main/java/org/cloudfoundry/autoscaler/scheduler/util/DateHelper.java#L15

  • and of course tests need to be adjusted/added

  • allowing seconds to be added still comes the with the fact that miliseconds are not taken into account which we are not 100% sure how the system will behave in that case

geigerj0 avatar Feb 21 '25 11:02 geigerj0

Thanks for the information, I indeed missed that.

martinvisser avatar Feb 21 '25 12:02 martinvisser