cloudformation-coverage-roadmap
cloudformation-coverage-roadmap copied to clipboard
AWS::ElasticSearch::Domain - support in-place upgrades
- Title -> AWS::ElasticSearch::Domain
- Scope of request -> Support in-place version upgrades. Currently attempting to change the version causes replacement of the cluster.
- Expected behavior -> Apply an update - like in the console
- Category tag (optional) -> Analytics
@danieljamesscott just confirming that this is an existing property of an existing resource that you'd like to behave differently?
Yes - like for RDS version upgrades. When I change the version in the definition for RDS, the instance is upgraded. When I change the elastic search version, the instance is replaced.
When the ElasticsearchVersion parameter of AWS::ElasticSearch::Domain change and the change is a supported upgrade (e.g.: from 6.7 to 6.8) Cloudformation should upgrade the cluster with an UpgradeElasticsearchDomain API call, instead of a CreateElasticsearchDomain API call.
A proper error message must be given in case of a non supported upgrade.
This must also have to work with 'named' clusters.
Agreed, we will be stuck at this version until this is supported, as we will not want to recreate a cluster. the ES service fully supports this CFN just needs to do the correct API call and logic.
I see this is being worked on, but an alternative approach could also be helpful: I don't mind manually upgrading ES in the console, and then setting the new version in my template. When a new version is found in an update, but the version matches the current domain version, do not take any action.
This would help with situations where domain upgrades can take multiple hours.
For those looking for the documentation: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html#cfn-attributes-updatepolicy-upgradeelasticsearchdomain
I think there's an issue with the amount of time taken to update the cluster - I'm not sure how CF works, but the ES resource in CF needs to send a 'still going' notification back to CF. After an hour, the CF stack times out... Updating Elasticsearch Domain did not stabilize. My cluster still shows Upgrade processing and now my stack is inconsistent with the state of the cluster. :(
@danieljamesscott This is exactly what i was worried about in my above comment, upgrade can take 4 hours. Edit: Thanks for being brave and testing for the rest of us.
urgh ...
Yes, but RDS seems to handle it just fine - although it's a slightly different mechanism, there is no UpdatePolicy. It may all work out OK eventually, if I re-apply the change, once the upgrade has finished. (As always, I applied this in our test account before trying it on anything important... ;) )
I don't think it's particularly important, but the cluster upgrade has just completed - took 20 hours.
Ahh, wonderful:
Failed to submit upgrade. Upgrade from 6.8 to 6.8 not supported. (Service: AWSElasticsearch; Status Code: 400; Error Code: BaseException; Request ID: 534b4933-10fb-11ea-821f-3fe3ace9e9ea)
Is there any update on this - I need the features of newer versions of ES.
As far as we've been able to investigate, the work from CloudFormation's side on this is complete. We reopened the issue to let people tell us if there are any show-stopping bugs. It doesn't appear that this is the case, other than the upgrades taking a long time. Let us know if we need to reopen this, but we'll close it for now.
@luiseduardocolon - i think the implementation is wrong as timeouts will be very common unless you have a cluster of 1 or 2 nodes and thats not a normal use case. the wait condition needs to be removed i think, when a version change is bumped it will check to make sure the update is happening but no wait for it to report end status.
I'm not sure whether it's CF or ES that needs to fix it, but something is badly wrong with this. Any cluster upgrade which takes longer than the CF timeout (1hr) results in a broken stack.
Also, it would be a nice workaround if CF could accept an update to the version which checks the actual cluster version, and accepts the change if the cluster is already upgraded to that version.
As it stands, I don't see how this can be seen as 'complete'. Any 'production' cluster will surely take longer than 1hr to upgrade and end up in a bad state.
@luiseduardocolon I and others have not used this feature yet because the thread made it obvious the work was not complete. Please give us strong assurances that the issues experienced by others have been resolved and then maybe I will attempt it on my own cluster.
Reopening to investigate further.
As it turns out, a bug on the ElasticSearch (ES) side was discovered that caused the slow operations. The bug was fixed by the ES team. We have received guidance that we don't need to increase our timeouts since these operations should not take this long in the first place, and that existing timeouts (1 hr) are appropriate for these upgrade operations. I strongly recommend that you retry your in-place upgrades again (if possible) so you can validate that this has been resolved.
I ran a cluster upgrade yesterday and I still saw this issue. The cluster upgrade took ~1h10m, barely exceeding the timeout limit for cfn. And this was for a dev cluster, which didn’t contain many documents, nor wasn’t scaled to the size of our prod fleet. I’m ok with ES taking long, I just want to be able to do a subsequent cfn deployment stack and not have it failed for not being able to upgrade to the same version.
I'm testing the upgrade today, and while it ran correctly on a first domain, on other larger domains I still ran into Updating Elasticsearch Domain did not stabilize.
After the 1hr timeout, it triggered a Cloudformation rollback (which of course also fails because Caution: version Downgrade is not supported during update rollback progress. Inconsistency could exist between stack template and the actual resource.)
Also a warning note for anyone that may be tempted to try : my cloudformation stack is now stuck, as any subsequent update fails Failed to submit upgrade. Upgrade from 6.8 to 6.8 not supported. (Service: AWSElasticsearch; Status Code: 400; Error Code: BaseException; Request ID: ...... )
@imgaray , @axelpavageau - have you opened support tickets for these? Let me know if you can share more info - like which region you are seeing this in, for example. Just wondering if the deployment of the fix encountered any problems.
@luiseduardocolon I have. It's been assigned and escalated but I don't have an answer to share yet. My issue is in eu-west-1. Let me know if there's anything else I can share to help. (note : I'm also available on the AWS Developers' slack if needed)
Would it be possible to support updates when the current version is the same as the supplied version? I'm concerned about applying this to any of my production ES clusters in case it fails.
@luiseduardocolon I ran into the same problem in both our production and development environment. I opened support ticket for them. My issue is in eu-central-1. Also, let me know if any other information is needed I will be very much to help. @axelpavageau Please share the answer when you make any progress. I am stuck with this bug for almost 3 months now, thanks!
@akhiljain100 the AWS support advised me to revert my cloudformation change (in my case "downgrading" from 6.8 to 6.2). This doesn't change the fact that my ES domains are no longer in sync with their cloudformation templates, nor the fact that for now I can't upgrade them anymore. However it allows me to deploy other (non-ES) changes without encountering rollbacks.
Thanks @axelpavageau Does that mean that your resources are now no longer managed by Cloud Formation? What is the benefit of having downgraded?
@akhiljain100 My understanding is that Cloudformation only attempts to upgrade domain version if they differ between different revisions of your stack, so the main benefit of proceeding with this option would be to allow your stack to transition to "update_complete", hence unblocking the deployment of other resources that may be in the stack.
The negative side effect of this would be that there would be a drift between your ES domain and its cloudformation definition (i.e. the real version being higher than the one defined in the template), so I wouldn't risk updating any attribute on that domain until this problem gets fixed.
@imgaray is right.
Another word of caution : I'm still waiting to hear back from support but any subsequent update to my cloudformation stack takes 3 hours to complete. I'm fearing this is a side effect from the initial ES issue.
2020-03-04 10:20:53 UTC+0100 mymainstack UPDATE_IN_PROGRESS User Initiated
2020-03-04 10:26:16 UTC+0100 substack UPDATE_COMPLETE -
2020-03-04 13:26:18 UTC+0100 mymainstack UPDATE_COMPLETE_CLEANUP_IN_PROGRESS -
2020-03-04 13:27:16 UTC+0100 mymainstack UPDATE_COMPLETE -
By the way, we are still actively chasing this issue...although I don't have a time-to-resolve yet, I am escalating this to several stakeholders internally. Please update this thread with your latest observations.