terraform-provider-netapp-cloudmanager icon indicating copy to clipboard operation
terraform-provider-netapp-cloudmanager copied to clipboard

504 error during deployment or destroying resources

Open bryanheo opened this issue 3 years ago • 25 comments

Hello

We are deploying NetApp CVO in AWS through Terraform and sometime we have 504 error during deployment as shown below but the actual resources are successfully created in AWS. Due to the error, TF state file is not updated and we have to re-deploy TF (destroying the existing AWS resources by CloudFormation and redeploying by Terraform Enterprise). If we re-deploy TF then it works ok. It also sometime happens when we destroy TF resources. Is it a known issue or Is it something you can investigate it?

504 error during the deployment Screenshot 2022-08-10 at 09 35 24

504 error during destroying TF resources

Error: code: 504, message: 
│ 
│   with module.usw2.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {

Regards Moon

bryanheo avatar Aug 10 '22 08:08 bryanheo

we have not seen this kind of issue before, could you send you playbook (.tf) file which is used? So that we can try to reproduce on our end.

suhasbshekar avatar Aug 12 '22 19:08 suhasbshekar

@suhasbshekar the error does not always happens but it sometime happens with other error messages like below In addition, when we deploy CVO HA cluster, it always takes 35 minutes. Is it normal?

Could you let me know the safe way to upload the files so that you can investigate it?

Error 1

╷
│ Error: Post "https://netapp-cloud-account.auth0.com/oauth/token": dial tcp: lookup netapp-cloud-account.auth0.com on 127.0.0.1:53: read udp 127.0.0.1:57538->127.0.0.1:53: read: connection refused
│ 
│ 
╵

Error 2

│ Error: Post "https://cloudmanager.cloud.netapp.com/occm/api/aws/ha/working-environments": dial tcp: lookup cloudmanager.cloud.netapp.com on 127.0.0.1:53: read udp 127.0.0.1:54913->127.0.0.1:53: read: connection refused
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 

Error 3

╷
│ Error: code: 500, message: {"message":"Server Fault","causeMessage":"ConnectException: Connection refused (Connection refused)"}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵

Error 4

 Error: code: 400, message: Failure received for messageId JDxc6CJu with context . Failure message: occm: Name or service not known
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵

Error 5

╷
│ Error: code: 400, message: Failure received for messageId Va9yIR5c with context . Failure message: {"message":"Connection refused: occm/10.5.20.4:80","cause":null,"stackTrace":[{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":96,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":82,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"recover","fileName":"Try.scala","lineNumber":233,"className":"scala.util.Failure","nativeMethod":false},{"methodName":"run","fileName":"Promise.scala","lineNumber":450,"className":"scala.concurrent.impl.Promise$Transformation","nativeMethod":false},{"methodName":"processBatch","fileName":"BatchingExecutor.scala","lineNumber":55,"className":"akka.dispatch.BatchingExecutor$AbstractBatch","nativeMethod":false},{"methodName":"$anonfun$run$1","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"apply","fileName":"JFunction0$mcV$sp.scala","lineNumber":18,"className":"scala.runtime.java8.JFunction0$mcV$sp","nativeMethod":false},{"methodName":"withBlockContext","fileName":"BlockContext.scala","lineNumber":94,"className":"scala.concurrent.BlockContext$","nativeMethod":false},{"methodName":"run","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"run","fileName":"AbstractDispatcher.scala","lineNumber":47,"className":"akka.dispatch.TaskInvocation","nativeMethod":false},{"methodName":"exec","fileName":"ForkJoinExecutorConfigurator.scala","lineNumber":47,"className":"akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask","nativeMethod":false},{"methodName":"doExec","fileName":"ForkJoinTask.java","lineNumber":289,"className":"java.util.concurrent.ForkJoinTask","nativeMethod":false},{"methodName":"runTask","fileName":"ForkJoinPool.java","lineNumber":1056,"className":"java.util.concurrent.ForkJoinPool$WorkQueue","nativeMethod":false},{"methodName":"runWorker","fileName":"ForkJoinPool.java","lineNumber":1692,"className":"java.util.concurrent.ForkJoinPool","nativeMethod":false},{"methodName":"run","fileName":"ForkJoinWorkerThread.java","lineNumber":175,"className":"java.util.concurrent.ForkJoinWorkerThread","nativeMethod":false}],"localizedMessage":"Connection refused: occm/10.5.20.4:80","suppressed":[]}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {

bryanheo avatar Aug 15 '22 14:08 bryanheo

yes, sometimes it will take 35 mins or more, but we test with demo version OR simple inputs, it depends on the complexity of the various inputs used.

suhasbshekar avatar Aug 16 '22 16:08 suhasbshekar

it can reach 35 minutes for HA. is that issue reproducible? the 504? in that specific case seems that your connector was restarted due to health failures

edarzi avatar Aug 17 '22 06:08 edarzi

@edarzi 504 error happens during mediator is created. I am trying to debug the issue but Cloud Manager timeline does not show the error and the CVO clusters are successfully created after the error. In order to update TF state file, I have to destroy the CVOs via CloudFormation and redeploy through TF again. Is there any ways to investigate it. How can I check the connector was restarted during the deployment?

Screenshot 2022-08-17 at 11 19 05

bryanheo avatar Aug 17 '22 10:08 bryanheo

Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?

bryanheo avatar Aug 17 '22 10:08 bryanheo

@edarzi @suhasbshekar as required, I have created NetApp support case (2009274344) and I uploaded the playbook file on the case. We are using a connector policy as guided by NetApp (https://docs.netapp.com/us-en/cloud-manager-setup-admin/reference-permissions-aws.html) Could you have a look?

bryanheo avatar Aug 18 '22 10:08 bryanheo

Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?

https://registry.terraform.io/providers/NetApp/netapp-cloudmanager/latest/docs/data-sources/cvo_aws

edarzi avatar Aug 18 '22 12:08 edarzi

@edarzi @lonico we still have the same issue and we are trying to import the resources rather than deleting CVO through CloudFormation. Could we import the CVO resources with 'terraform import' rather than using data source?

module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Creating...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Still creating... [10s elapsed]
╷
│ Error: code: 400, message: {"message":"The name netappamtnuse1pri is already used by another working environment. Please use another one.","causeMessage":"BadRequestException: The name netappamtnuse1pri is already used by another working environment. Please use another one."}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵
moonyoung.heo@C02C35ZVMD6T ap-netapp-np % terraform import module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this VsaWorkingEnvironment-xxxxx
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Importing from ID "VsaWorkingEnvironment-xxxxx"...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Import prepared!
  Prepared netapp-cloudmanager_cvo_aws for import
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Refreshing state... [id=VsaWorkingEnvironment-xxxxx]
╷
│ Error: code: 400, message: Missing X-Agent-Id header
│ 
│ 
╵

bryanheo avatar Aug 25 '22 12:08 bryanheo

No we don't support importing a connector. The APIs do not allow us to fetch enough information.

It would be better if Cloud Manager could provide an API to create a connector, rather than us having to go through the Cloud Provider APIs and Cloud Manager APIs. This introduces a level of complexity.

lonico avatar Aug 25 '22 17:08 lonico

@lonico @edarzi @suhasbshekar the issue keeps happening from Terraform Enterprise and local laptop. I cannot see any error on the timeline of Cloud manager. The CVO are successfully deployed in AWS while the error occurs but I have to redeploy the CVOs due to the inconsistent TF state file. Do you have any methods to find out why 504 error happens?

image

bryanheo avatar Sep 01 '22 08:09 bryanheo

@bryanheo Since it looks like a Cloud Manager issue, I would suggest you open a case to track this issue.

@suhasbshekar @edarzi Should we retry on such an error? How many times? Can we be more specific about the context?

lonico avatar Sep 01 '22 14:09 lonico

@lonico Thank you for your suggestion. I am not sure whether this issue is related to Cloud Manager or not because I did not have 504 error when I deployed CVO by Cloud Manager manually. Anyway, as you suggested I will create a case on NetApp support site.

bryanheo avatar Sep 02 '22 11:09 bryanheo

Will need some more details in order to track and debug. Ping me at [email protected]

edarzi avatar Sep 02 '22 11:09 edarzi

@edarzi Thank you for your reply. As mentioned earlier, I have uploaded our entire TF code on NetApp support case (2009274344) and could you have a look? If you cannot access the case, please let me know

bryanheo avatar Sep 02 '22 21:09 bryanheo

I will need logs from the connector


From: bryanheo @.> Sent: Saturday, September 3, 2022 12:47:13 AM To: NetApp/terraform-provider-netapp-cloudmanager @.> Cc: Darzi, Eran @.>; Mention @.> Subject: Re: [NetApp/terraform-provider-netapp-cloudmanager] 504 error during deployment or destroying resources (Issue #116)

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.

@edarzihttps://github.com/edarzi Thank you for your reply. As mentioned earlier, I have uploaded our entire TF code on NetApp support case (2009274344) and could you have a look? If you cannot access the case, please let me know

— Reply to this email directly, view it on GitHubhttps://github.com/NetApp/terraform-provider-netapp-cloudmanager/issues/116#issuecomment-1235919734, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALB4HM3VTMJLGUOE4XXNLPDV4JYWDANCNFSM56DU5WXA. You are receiving this because you were mentioned.Message ID: @.***>

edarzi avatar Sep 03 '22 10:09 edarzi

@edarzi could you let me know how to get the logs from the connector? Could we use AutoSupport?

bryanheo avatar Sep 05 '22 09:09 bryanheo

you can download the auto support file from the Cloud manager UI and send it to my mail please you can also send me the service manager log from: /opt/application/netapp/cloudmanager/log/service-manager.log

edarzi avatar Sep 05 '22 10:09 edarzi

@edarzi Any update on this. We're attempting to add a retry. But without understanding the root cause, we don't know if a retry would help, or how many times / how long we should try.

lonico avatar Sep 08 '22 16:09 lonico

@edarzi I have sent email with the auto support file from the Cloud manager UI but the file size is about 30MB and it has been rejected by your mail server. Could you let me know where to upload the 30MB file? (NetApp Support ticket does not allow autosupport 7z file either) In addition, I do not know how to get /opt/application/netapp/cloudmanager/log/service-manager.log. Could you let me know how to get the log file?

image

bryanheo avatar Sep 08 '22 16:09 bryanheo

We released 22.9.0 yesterday (9/8). It provides some retries on 504 errors. Can you see if it helps?

lonico avatar Sep 09 '22 14:09 lonico

@lonico I have deployed NetApp CVO clusters several times with 22.9.0 and I have not seen 504 error so far. It looks better than previous version. I will let you know if we have the error again

bryanheo avatar Sep 12 '22 14:09 bryanheo

That's great news. As you know, we added a retry on 504. You could see it in the logs by setting TF_LOG to DEBUG or TRACE. I'm curious to see if it always work on the first retry (which would indicate some sort of transient issue) or if we need to retry several times.

lonico avatar Sep 12 '22 14:09 lonico

Hi @lonico

I`m Gabor with NetApp Tech Support and have been working with the customer on this issue.

@bryanheo as discussed, for me to investigate from the cloud manager end, we would need to have logging verbosity enabled in the cloud manager. This might allow us to see how long it takes for cm to process the requests and we can proactively enhance the software to work better with terraform.

Once done, simply trigger a cloud manager auto support and I will review it.

laagabi avatar Sep 26 '22 14:09 laagabi

Hi @lonico I thought the issue has been resolve but it has happened again. As mentioned above, NetApp AWS resources have been successfully created but with 504 error, Terraform State has not been updated. In other words, we have to redeploy the cluster. Could you investigate it?

image

image

bryanheo avatar Nov 04 '22 09:11 bryanheo

Hi @bryanheo, sorry for the very delayed response, please let us know if this issue still persists?

suhasbshekar avatar Jan 31 '25 19:01 suhasbshekar

We have added a exponential backoff logic to 504 issue with release v25.2.0

suhasbshekar avatar Mar 05 '25 19:03 suhasbshekar