deploykit [Question] Using autoscaling API with infrakit.

Hi,

I'm looking for Infrakit for using our own instances management. If I have to create 1000 >= instances with Infrakit, infrakit-instance plugin makes 1000 API Requests. Because GCE and AWS have API rate limit,All API requests are forced to take time and restart to succeed. I think it better to use clouders' autoscaling API.. What do you think about using Infrakit for large scale instance management? According to current plugin design, Instance.Provisioin means that 1 instance provision.

Thanks, MS

Dec 11 '16 11:12 anarcher

Hi @anarcher ! My opinion, Instance.Provisioin means only 1 instance provision make responsibility of plugin clear. So Aws-AutoScaleGroup-instance-plugin or auto scale configuration in terraform layer is better to solve API rate problem.

Dec 11 '16 15:12 YujiOshima

Hi,@YujiOshima

Thanks for reply. :-) With current instance plugin interface, AWS-ASG-Instance-plugin should has a buffer or queue of provision request for reducing AWS API request.(e.g. After some duration, Set DesiredSize) Is it right?

Thanks, MS

Dec 11 '16 16:12 anarcher

@anarcher @YujiOshima

The semantics of Instance.Provision is to provision one instance at a time. Using an instance plugin that wraps an AWS ASG (autoscaling group) would be confusing because there will now be two controllers -- the running InfraKit default group plugin that uses this instance plugin and the platform's scaling group controller.
It's possible to provide an alternate implementation of the Group plugin backed by an AWS ASG. This however would have some differences in how volumes are managed. There isn't a reference implementation yet, but we are thinking of creating one for AWS ASG so any help / interests in this are welcome.
Because we'd like to keep writing Instance plugins simple, buffering/queueing Instance.Provision calls and handling retries for retry-able errors really should be implemented in the default Group plugin. Currently the scaler code doesn't take these conditions into account, but adding proper handling of errors and retries there will hopefully solve the problem for all platforms and keep the instance plugin implementations simple.

Thoughts?

Dec 11 '16 23:12 chungers

Thanks for the detailed explanation @chungers :-)

In my experience,when managing an instance with ASG, I did not have any complicated management of the volume (if I needed volumes in it, I simply made the volume life cycle equal to the instance life cycle).

As for the complex volume management, I think that it is a way to manage the container-specific volume with Docker volume driver.

Do you think that the AutoScaling-Group plugin has a different structure that can not be used with the existing Instance plugin? (But Can be used with flavor plugin?)

My first Thought (before your reply) is that if Instance.Provision interface is Instance.Provision(specs []Spec) ([]*ID,error), ASG-Instance-Plugin may not be bad. But after you mentioned it,I realized that ASG-Instance-Plugin and Group Plugin are basically two group controller.

Another question is that, if I write a scheduler for multi-group handling, should it be a module that externally makes/sends all configuration JSON to Group.Commit? (e.g. AllocationMethod.Size with Instance Plugin Configuration? I just would like to change AllocationMethod.Size only..)

+1 Group default plugin has buffing/queueing instance provision requests :-)

Dec 12 '16 05:12 anarcher

Thankyou @chungers ! I agree two controllers would be confusing and autoscale implements in group plugin is better. But I also wander autoscale group plugin needs specific instance plugin.

+1 default Group plugin would have buffering/queueing instance creation and handling of errors and retries.

Dec 12 '16 06:12 YujiOshima

A big thank you to @YujiOshima for #332. This PR addresses the issue of rate limited provider API calls and makes the default group plugin useable for managing large groups of instances.

@anarcher - for multi-group handling... have a look at https://github.com/docker/infrakit/blob/master/pkg/manager/spec.go#L11 Note that we have a spec that allows groups to be specified in a single JSON. Currently the manager exposes a Group API but I think we need to introduce a new 'manager API' that accepts a global JSON config that can include groups and resources (see #290). In terms of direction for future, I think this single entity will become the endpoint for handling of multiple groups, instead of overloading the current group and instance API. This single entity (manager) will also be responsible for starting plugins based on the user specification (similar to Docker Compose) and then dispatch the group specs to each group plugin (via their commit method). So the manager acts like a coordinator for a set of group plugins that are referenced in the global JSON spec.

With this multi-group handling, we can also provide a group plugin that is backed by the provider's scaling group (e.g. AWS ASG). This way, it's possible to mix-and-match groups that are managed by different controllers (InfraKit group or ASG) without the confusing case of two controllers fighting to manage a single group.

Thoughts?

Dec 12 '16 22:12 chungers

I agree with the idea for your multi-group with a manager and I like it. I also think that the Group Plugin API is not enough and Manager API is needed. If I change only one part of the setting, it may be inconvenient to pass all the parts. (It may be a good interface to only reflect the changes made like git commit changeset)

Thank you for your kind reply @chungers

Dec 14 '16 05:12 anarcher

I have another questions...

Instance creation will create costs. If the problem is a configuration issue or other problem, the instance has been created for a while, but Infrakit will endlessly reproduce it because of flavor.Health decides the instance is "Unhealthy". It is a useless cost. Is there any way to prevent this? In AWS, Instance cost is charged per 1 hour. (Partial hours are billed as full hours.) (e.g. max-retry-num? Using circuit breaker?)
Does the Infrakit Group have a status? (e.g. Converged bool is not enough IMO).
And What do you think about Group Commit ID? And Logging with commitID?
And Metric about Infrakit
IMO, Commit ID , metric, Group status are useful for infrastructure monitoring and alerting.

ps) I'm sorry to bother you if my questions are meaningless. :)

Dec 19 '16 11:12 anarcher

I have some ideas about the points you've raised. They are all very insightful and valuable feedback. Thank you @anarcher

Instance creation will create costs. If the problem is a configuration issue or other problem, the instance has been created for a while, but Infrakit will endlessly reproduce it because of flavor.Health decides the instance is "Unhealthy". It is a useless cost. Is there any way to prevent this?

Some ideas here:

Introduce a cap to the max group size in the schema. This may be easy to implement and can be found in many scaling group implementations. Still, if the max is set too high, unnecessary costs will still be incurred.
Improve the health check feedback loop with retries, etc. as you suggested. This won't help you in the case of bad configurations where the instances are set up incorrectly (thus always failing health checks) or when there are problems with flavor (always returning error).
Add throttling / backoff in the group controller. The group controller checks for convergence periodically and in each period, new instances are created if reality doesn't match desired state (e.g. size of group). By comparing the number of instances we provisioned in the last period vs what we need to provision in this time period, we should be able to determine if we are in a run-away situation. Obviously the provider's API throttling would have to be taken into account. Still, this seems like a pretty good place to investigate and come up with improvement here. This will make the group plugin more robust and will hopefully be immune to potential problems with instance / flavor plugins or bad configurations. Thoughts??

I think this topic here is important -- can you open an issue and we can discuss there?

Does the Infrakit Group have a status? (e.g. Converged bool is not enough IMO).

What sort of information are you looking for? Like more detail counts and plans?

And What do you think about Group Commit ID? And Logging with commitID? And Metric about Infrakit

How do you plan to use commit ID? Can you give more information?

Dec 19 '16 20:12 chungers

Does the Infrakit Group have a status? (e.g. Converged bool is not enough IMO). What sort of information are you looking for? Like more detail counts and plans?

IMO, Some status machine can be helpful. e.g. Converged , Unconverted , Permanent failure? The detail counts is too. I have no detail idea exactly.. :-(

And What do you think about Group Commit ID? And Logging with commitID? And Metric about Infrakit How do you plan to use commit ID? Can you give more information?

Some random ID or open-tracing style (Span ID?) I have no detail idea about it. T_T

Thanks for kind reply.

Dec 25 '16 13:12 anarcher

deploykit deploykit copied to clipboard

[Question] Using autoscaling API with infrakit.

deploykit
deploykit copied to clipboard