AutoSpotting
AutoSpotting copied to clipboard
Optional feature: terminate on-demand instance before starting replacement spot
Issue type
- Feature Idea
Summary
In some specific use cases, you might not want to have more instances in your cluster than your Desired number in the ASG. Currently you get Desired+1 for a certain amount of time, since Autospotting will launch the replacing spot instance before terminating the on-demand instance. Would be nice for the behaviour to be selectable.
Thanks for the proposal.
It's next to impossible to get the perfect replacement timing so you either get overlaps or capacity drops, and the reality is that the drops are typically longer and riskier than the overlaps and that's why the initial implementation was hard-coded to overlapping capacity
I guess the main use case when the overlaps are harmful is for stateful instances, such as those that need to attach a persistent EBS volume at boot time.
But even for those I think there is a good enough workaround by repeatedly attempting to attach the volume from the running spot instance, which will only succeed once the volume is detached from the previous on demand instance.
I am open to this feature as long as you can come up with a use case that has no such workarounds. The implementation for it should be relatively easy, but I would not have it done before there's a real use case for it that can't be accommodated otherwise.
It's clearly an optimization feature, so as such I don't think there will be a scenario where it absolutely can't be done otherwise. But that's the point of an optimization.
An alternative or additional feature might be to find a way to communicate which on-demand instance will be destroyed to the new spot-instance (using tags?). That way, that new spot-instance can pro-actively do some work on the cluster (potentially cleaning / detaching / draining / modifying configuration) in order to prepare for the on-demand instance removal, and prevent overlaps. I could probably work with that.
I don't think we should pass this information, it can be determined easily.
The algorithm actually assumes that the actual on demand instance may no longer exist, so it will terminate any of the on demand instances it can find at the moment in the same availability zone.
This termination logic could be easily skipped when so configured, and delegated to the spot instance which would have to implement it as it sees fit. The only constraint is the availability zone, in order to avoid rebalancing actions.
So you're saying that you would delegate the choice of which on-demand instance to terminate, and the actual termination, to the new spot instance? That could work, I guess.
Yes, exactly.
When configured in this mode Autospotting would only launch the spot instances, it would no longer handle the attach/detach actions, all this logic would need to be done from the instances, which can coordinate to achieve a seamless transition, with constant capacity.
Although having to re-implement and duplicate the selection/termination logic in every booted spot instance feels like a burden, it could work. +1 then.
If you end up having a REST API somewhere for Autospotting as you talk about in another ticket, it would make it that much easier; just hit the correct URL when ready, specifying which on-demand instance to terminate (the id of which you probably got from a previous call to that that REST API).
I guess the logic could be extracted under a library and used to build a dedicated tool.
The REST API would work as well but may not be perfect. The best place to implement this is on the instances because they are aware of the state of the application from both the spot and on demand instances, in real-time.
This is no longer an issue, the current issue terminates Spot instances to trigger their replacement with Spot and then replaces them within seconds of booting up, without having instances running outside the cluster.
@wegel I'd love to have a chat with you regarding other issues and feedback you may have. Let me know if you're interested.