Failing svc deploy caused by nested stack changeset limit
Hi 👋🏻 We're facing an issue where our copilot cli managed fargate deployment (load balanced svc) fails with:
> copilot svc deploy --diff-yes --name $COPILOT_API_SVC --env $COPILOT_ENV --force
✘ deploy service api to environment staging: deploy service: wait for creation of change set copilot-995e5c10-aea9-43a1-b3fb-fd9182f6a95e for stack staging-api: ResourceNotReady: failed waiting for successful resource state: ChangeSet limit exceeded for stack arn:aws:cloudformation:us-east-1:000000000000:stack/staging-api-AddonsStack-115OY9YZK7MZZ/cbc47f70-f082-11ec-bcd3-0e3807434699: Resource creation cancelled
The issue is there is a lot of "failed changesets" for the nested addon stack with status: "The submitted information didn't contain changes. Submit different information to create a change set.". The nested addon stack rarely has a reason to change since it just contains storage and some roles but the buildup of failed changesets causes us to hit some cfn quota limit (I think). The only pointer I have been able to find as to why it happens so far is in this aws-cli issue https://github.com/aws/aws-cli/issues/4534.
So far I have not found a way how to delete the offending changesets to unblock the release because the delete commands fail with:
An error occurred (ValidationError) when calling the DeleteChangeSet operation: Nested change set must be deleted from root change set
Apparently as pointed out in https://github.com/aws/aws-cli/issues/4534#issuecomment-1532149964 the only way to remove the changesets is to create a change from the root stack that updates the nested resource, however I am unsure how to safely do this change so that I won't create further problems with how copilot tracks the resources.
Can you please advise on what's the best way to recover the stacks to a healthy state? Since the nested addon stack will rarely contain any changes it seems that the failed changesets will build up indefinitely, is there a way to automatically clean up the failed changesets that don't contain changes?
Thanks 🙏🏻
Hi @raethlo! Apologies for the trouble 😞. I went ahead and tested this, and also found myself trapped in a loop:
- Tried to delete the nested stack's change set ➡️
Nested change set must be deleted from root change set(same as yours). - Tried to delete the root change set ➡️
Cannot delete ChangeSet in execution status EXECUTE_COMPLETE.
Meaning that there is no way to delete the nested stack's change set, if the root change set happens to be in a good state.
The only workaround that I can think of right now is like what you've linked, unfortunately 😞, to create a changeset that actually does update the nested stack. For example, you can update the Tags property of a resource by adding a dummy tag, and then removing it later on. Be sure to double check by going to a doc page like this one to make sure that updating the Tags property does not trigger a "Replacement". From what I see and understand, it should be "No interruption" for the majority of AWS resources, but please double check to make sure.
is there a way to automatically clean up the failed changesets that don't contain changes?
From what I tested above, because of the error Cannot delete ChangeSet in execution status EXECUTE_COMPLETE, I don't see a good way for Copilot to handle this either: Copilot could have attempted to delete any old change sets, but the attempts would have failed anyway because of the said error. I am reaching out to the CloudFormation team to understand this issue better. There is also the possibility that they know a better workaround than what I have - I'll update this thread to let you know if we can find a new workaround. Apologies for the issue 😞 !!
Hey @Lou1415926 thanks for looking into this 🙏🏻 I'll try to create a change that updates the nested stack and will circle back and post if it worked or not. Curious to see if the cfn team will have a better workaround.
@raethlo yeah let me know how it goes! I tested the workaround myself yesterday, and it was successful in my case.
I've discussed the issue with the engineers from cfn, and the workaround that we reached was similar to what I suggested above. Instead of altering the Tags property of some resource, you could also try adding a 'temporary' stack-level tag, from the parent stack as the only change. This change will be propagate to the nested stacks.
@Lou1415926 what ended up working for us after some trial & error was adding a dynamic resource tag to the deploy.
copilot svc deploy --name api --env staging --resource-tags 'release-version=main-8ed091d-7576227310' --force
this unblocked the cfn and also cleaned up the built up failed changesets. I don't know how common the issue is (it seems weird to me that it wasn't reported before, so it might be caused by sth on our end) but if there is no downside to doing so, copilot could automatically tag managed resources on deploy to avoid hitting the limit.
Anyways, thanks for the help 🙏🏻
This issue is stale because it has been open 60 days with no response activity. Remove the stale label, add a comment, or this will be closed in 14 days.
This issue is closed due to inactivity. Feel free to reopen the issue if you have any further questions!