quickstart-microsoft-sql icon indicating copy to clipboard operation
quickstart-microsoft-sql copied to clipboard

Failure to create Stack - keep hitting Waitcondition timeouts

Open aelsnzhvr opened this issue 4 years ago • 8 comments

We are running this quickstart a number of times and keep getting failures on Timeouts hitting the WaitCondition. Tried with 2019 with option to create new VPC and clean setup, but we cannot seem to get past a successful creation of the SQL Stack.

The main stack fails on the SQLStack with

The folowing resource(s) failed to create: [SSMWaitCondition]

When you review that stack you get the event failure as:

SSMWaitCondition - Received FAILURE signal with UniqueId: ....ec2 instance id

Looking at CloudWatch we can see there was a log group with log stream that contained <instance-id>/aws-runPowerShellScript/stderr this was specific to the EC2 instance mentioned above:

Below is quick summary of output: image

I guess in short just want to confirm the quickstart still works with the AMI images and scripts specified.

We tried with various options in 2019 and cannot get a working stack - after a few hours things time out and roll back initiates.

Any advice on best places to help track down what is happening would be much appreciated.

Tried with 2017 and that did not work either.

Got timeout errors again in Systems Manager Automation scripts -

image

aelsnzhvr avatar Aug 08 '21 00:08 aelsnzhvr

Here is a link to the Quickstart on Amazon.

dbkranes avatar Aug 08 '21 00:08 dbkranes

https://aws-quickstart.s3.amazonaws.com/quickstart-microsoft-sql/templates/sql-main.template.yaml

dbkranes avatar Aug 08 '21 00:08 dbkranes

I did reach out to the Advanced AWS Partner Datavail that is listed on the AWS Quick Start both via email and phone with no follow-up and zero assistance or resolution of any issues. Using only the basic template defaults for SQL Server with Always On Replication in AWS even with bigger EC2 instances using SQL 2017 or 2019 there is usually some sort of error and won't complete.

The best I was able to accomplish with the AWS SQL AG template was on the second page before initial creation on the Configure stack options under Stack creation options expand that and check Disabled under Rollback on failure. This will at least allow you to manually pick up where the AWS SQL Availability Group template automation failed. It doesn't fix any of the provided AWS Quick Start issues with the SQL template or underlying scripts but it gets you to a state that has most of the AWS networking and windows OS part completed.

Good luck

cfendrick avatar Aug 13 '21 16:08 cfendrick

The error listed above is a PowerShell Error indicating networking issues. Did you deploy into an existing VPC or did you have a new VPC deployed? Which Active Directory Options did you choose? Currently Looking into this error, and will provide updates and potential resolution. However, the issue seems to be due to privileges according to this Microsoft KB article.

virtlima avatar Nov 08 '21 16:11 virtlima

Experiencing the same error message when I try to deploy using the SQL quickstart. I'm deploying into an existing vpc, using a third AZ for the witness, to a self-managed AD that was deployed using the AD quickstart.

mdancy-sev1 avatar Dec 02 '21 03:12 mdancy-sev1

Can you try testing the code in the Develop Branch? Currently working to get approvals to merged into main, but would appreciate feedback is it helps resolved the issues.

virtlima avatar Dec 02 '21 18:12 virtlima

Got it. Testing now.

mdancy-sev1 avatar Dec 02 '21 18:12 mdancy-sev1

Capture That was the first error I encountered when using the develop branch.

I then rebuilt again and encountered the original error - could be due to me executing 'winrm quickconfig -q' as found here https://social.technet.microsoft.com/wiki/contents/articles/13458.windows-server-troubleshooting-cau-cluster-connectivity-problems.aspx

On the third deployment the cluster deployed and the original issue is displayed

PowerShell DSC resource DSC_SqlAG failed to execute Set-TargetResource functionality with error message: System.InvalidOperationException: Failed to create the availability group 'SQLAG1' on the instance 'MSSQLSERVER'. ---> System.Data.SqlClient.SqlException: Failed to bring availability group 'SQLAG1' online. The operation timed out. If this is a Windows Server Failover Clustering (WSFC) availability group, verify that the local WSFC node is online. Then verify that the availability group resource exists in the WSFC cluster. If the problem persists, you might need to drop the availability group and create it again. Failed to create availability group 'SQLAG1'. The operation encountered SQL Server error 41131 and has been rolled back. Check the SQL Server error log for more details. When the cause of the error has been resolved, retry CREATE AVAILABILITY GROUP command. at Microsoft.SqlServer.Management.Common.ConnectionManager.ExecuteTSql(ExecuteTSqlAction action, Object execObject, DataSet fillDataSet, Boolean catchException) at Microsoft.SqlServer.Management.Common.ServerConnection.ExecuteNonQuery(String sqlCommand, ExecutionTypes executionType, Boolean retry) --- End of inner exception stack trace --- + CategoryInfo : InvalidOperation: (:) [], CimException + FullyQualifiedErrorId : ProviderOperationExecutionFailure + PSComputerName : WSFCNode1

The SendConfigurationApply function did not succeed. + CategoryInfo : NotSpecified: (root/Microsoft/...gurationManager:String) [], CimException + FullyQualifiedErrorId : MI RESULT 1 + PSComputerName : WSFCNode1

mdancy-sev1 avatar Dec 02 '21 20:12 mdancy-sev1