aws-cdk-rfcs RFC 64: Asset garbage collection

trafficstars

This is a request for comments about Asset Garbage Collection. See #64 for additional details.

APIs are signed off by @njlynch.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

Sep 03 '21 21:09 kaizencc

See https://github.com/jogold/cloudstructs/blob/master/src/toolkit-cleaner/README.md for a working construct that does asset garbage collection.

Feb 01 '22 16:02 jogold

@jogold - thank you for this construct. I tried it out on my personal CDK project and it worked great! I think this is an excellent proof-of-concept and would love to see it eventually integrated into the official CDK project.

Feb 04 '22 18:02 blimmer

@jogold thanks for the POC for asset garbage collection! This RFC is something that we have on our radar for later this year and when the time comes I'd be happy to iterate on this and get it into the CDK.

Mar 16 '22 20:03 kaizencc

hello @kaizencc / others , I've been following this for a long time and though about pinging to find out if this is now being worked upon as you mentioned that it'll happen later in the year and we are now nearing the end of the year.

Oct 24 '22 15:10 ustulation

The problem I'm facing currently is that various compliance software highlight that there are insecure images in the ECR - it turns out they are pretty similar to what the automatic ECR scans provide (specially the ones labelled High for severity). They are all images from the past which are no longer in use as newer versions were pulled and used and so on. However since they all linger around, the compliance check fails because it doesn't know that we aren't using the older ones.

This ofc turns out to be a big problem for orgs which need to stick to such compliance checks, so we end up cross-referencing the CFn template and manually deleting the old images no longer in use.

Oct 24 '22 15:10 ustulation

Assuming the problems mentioned here are real and not due to my lack of knowledge (please correct me otherwise, else I might just be overthinking all this) here's an algorithm that could work in practice:

List all the stacks in the given env (account+region)
For each stack, get the template body and immediately follow it by
Get the stack status
If the status is one of {UPDATE_IN_PROGRESS, UPDATE_FAILED, UPDATE_ROLLBACK_IN_PROGRESS}, abort the garbage cleanup as it's not safe - re-run the entire cycle at some later point, or whenever the user chooses
Once all stack-templates are collected, extract all the hashes (the usual [a-f0-9]{64} assuming they are always going to be smaller cased).
Check the cdk-staging bucket and ecr repo for all assets which contain any of the collected hashes. Leave them alone.
For the rest, which aren't referenced by the templates, if they are older than X, delete them.

In absence of being able to atomically get both the template and the stack-status, this should practically work, though it has theoretical edge cases.

The most interesting here is the transition from 2 to 3. The only time we have a problem is when we get the template which was currently being applied (an UPDATE_IN_PROGRESS). That can be rolled back by CFn to a template we did not manage to collect, so we can't consider the one that we did and must abort. It can have a false positive for abortion in the case when you got 2 the stack was stable and it only went into an UPDATE between 2 and 3. In this case you could actually use the template you got and carry on, but since there is no way of knowing that, you just assume whatever 3 gives to be what the state was when 2 happened. So we play it safe and abort.

UPDATE_ROLLBACK_IN_PROGRESS in 3 is also not safe. It could also have been UPDATE_ROLLBACK_IN_PROGRESS in 2 in which case it would have been safe as the rollback template is what we got and what CFn will eventually finish up with, but there's no way of knowing this. In a bad case 2 could have been UPDATE_IN_PROGRESS but it went to UPDATE_ROLLBACK_IN_PROGRESS in 3 and the template has changed but you now have the template that is no longer good and you could end up deleting resources pointed to by the template being rolled back to (timestamps don't help, those resources might be a year old for eg. - you simply must not delete them). Since the transition from UPDATE_IN_PROGRESS -> UPDATE_ROLLBACK_IN_PROGRESS happens quickly, depending on the network latency etc. there is a realistic chance of this race being experienced in practice between steps 2 and 3.

Same with UPDATE_FAILED. UPDATE_IN_PROGRESS -> UPDATE_FAILED is fairly quick (even quicker than the one above). No use risking it, so abort.

Edge case:

The rest should be practically OK. There is a theoretical chance that the latency between calls 2 and 3 was big enough that in the meantime the stack went all the way from UPDATE_IN_PROGRESS state to UPDATE_ROLLBACK_COMPLETE state. So you would interpret the template you got in 2 as that of stack state when it's in UPDATE_ROLLBACK_COMPLETE state which you got in 3, and since that is a stable state (CFn will not apply any further updates by itself without user prompting it), you will "consider" that template when you actually shouldn't since that was the discarded template by CFn due to rolling back. However in practice, the time diff between 2 and 3 is in milliseconds or at worst a second or two. I've never seen a transition from UPDATE_IN_PROGRESS/UPDATE_FAILED (actual state during "2") --to--> UPDATE_ROLLBACK_IN_PROGRESS --to--> UPDATE_ROLLBACK_COMPLETE (actual state during "3") happen so fast that this will be a problem. I had to put a sleep between 2 and 3 to simulate this.

All other cases, such as you missed collecting a stack because it was CREATE_IN_PROGRESS after 1, or if some assets were just uploaded but you finished 2 and 3 for a stack that is soon going to be updated as you do 7 etc., can be solved by a timestamp comparison. While dealing with an environment, allow for the biggest interval for your pipeline between asset upload stage and CFn changeset execute stage in that env, and only delete assets that are older than this interval from the time the gc workflow runs for that env. It'll be larger for later stages in the pipeline (or you can just set a blanket expiry interval of X days + whatever-time-it-normally-takes-for-the-pipeline-to-reach-the-last-stage.

This doesn't deal with IMPORT_* family of statuses because i've never used them personally, but i believe it should follow the same logic as UPDATE_* above?

Alternatives

If there was a CFn API call to (atomically) get both the template and the status then you wouldn't have to handle some of the case above explicitly. You would still need to either abort or poll-and-wait in case of UPDATE_IN_PROGRESS/UPDATE_FAILED because you would have got the template that would soon be discarded by CFn and rolled back to one which you don't have.
If CFn had an API call that returned either one template (which is where the CFn "stopped" after applying - ie. reached a stable state) or two templates (the one CFn is currently applying + the one it would rollback to if the update failed), then you don't have to deal with this mess at all - just use both the templates when available and blindly delete resources not referenced by any of such collected templates (from all stacks in the env). This would be awesome! You will still need the timestamp check for the assets belonging to stacks you didn't know about during your scan etc. But that's always going to be the case unless there's an impossible global lock for the account.
Have a different bootstrap-qualifier for each pipelined project. That way every stage can have an additional stack which depends on all the other stacks for the project in that stage, and whose sole job is to run the gc workflow after every deployment. It will only run once the upstream stacks are updated/created successfully and are stable. Or have a post-deployment action per stage in the pipeline that does this. In either case, the pipeline locks the deployment to the stage until all actions/updates/rollbacks complete, so you don't have to care about stack-status. Further due to separation of projects using cdk bootstrap-qualifier, the projects have their dedicated bucket/ecr-repo per stage and other stacks from other projects don't interfere. This would be a clean solution too i guess? Asset duplication would be a drawback - e.g. the source code of log-retention lambda etc. are usually shared for the same env. Now they'll be in per project asset/staging buckets - which may not be all that bad.

Nov 04 '22 17:11 ustulation

I'm currently coding this algorithm for my project with an additional check to retry N number of times (and then abort), for every stack for which the time diff between start of 2 and end of 3 above is >2secs to safeguard against the edge case listed above.

Nov 04 '22 18:11 ustulation

aws-cdk-rfcs aws-cdk-rfcs copied to clipboard

RFC 64: Asset garbage collection

Edge case:

Alternatives

aws-cdk-rfcs
aws-cdk-rfcs copied to clipboard