ec2-snapper icon indicating copy to clipboard operation
ec2-snapper copied to clipboard

Deletion options don't work well with long-running instances?

Open brikis98 opened this issue 5 years ago • 1 comments

Imagine you have an instance with Name tag foo running in an ASG, and you're using ec2-snapper to take a snapshot of it once per day. Each time that instance crashes or is redeployed, the old instance is terminated, and a new one, with the same Name tag, is brought in to replace it.

If you try to clean up old snapshots for this instance with ec2-snapper delete --ami-name=foo, you'll end up with one of the following scenarios:

  1. Simple case: The ASG had just terminated and replaced the instance, so you end up with two instances (one terminated, one running) with the same Name tag for a short while. That will lead to an error.
  2. More complicated case: Let's say you had instance ID i-12345 running in the ASG for 2 weeks. You took a snapshot each day, so you now have 14 snapshots. The delete command is configured to delete snapshots older than 15 days, so nothing has been deleted yet. On day 15, instance i-12345 crashes and the ASG automatically replaces it with a new instance with ID i-67890. A few hours later, the delete command runs, and the first step is to look up the instance ID by name. It looks up foo and finds the new ID i-67890 for it. It now tries to clean up all snapshots for i-67890, but of course, that's a new ID, so it finds no snapshots for it, and, unfortunately, does not clean up any of the snapshots for the old ID i-12345. Over time, this sort of bug results in lots of snapshots being left behind for older instance IDs.

Perhaps we should tag the actual snapshots with the instance name and look up snapshots by name directly, instead of going through the instance ID intermediary?

brikis98 avatar Jul 22 '19 11:07 brikis98

Perhaps we should tag the actual snapshots with the instance name and look up snapshots by name directly, instead of going through the instance ID intermediary?

Yep, I think that's the right approach. This humble tool was written years ago for the use case of backing up a single, presumably persistent server. It was never really designed for the ASG use case, and I think the approach you propose sounds solid. PRs welcome!

josh-padnick avatar Jul 30 '19 22:07 josh-padnick