terraform-provider-argocd icon indicating copy to clipboard operation
terraform-provider-argocd copied to clipboard

Dial proxy connection refused with '--core' option

Open SteveJ-IT opened this issue 7 months ago • 7 comments

Hi,

We are using this provider in our Terraform CI/CD and are communicating with the unexposed ArgoCD API via Kubernetes using the 'core = true' option in provider configuration. However, if we apply or destroy with Terraform where more than one change takes place e.g. a new project and role token, or two new applications, the provider only makes one change (or none if destroying) and fails on a random port with a connection refused error.

If you only apply one change then it works. I've used terraform 'target' to deploy two changes, each done individually, and it works but if you then destroy it will try to remove more than one resource and fail on the same error.

Is there any reason why the provider is failing to communicate when encountering the second and beyond change? As the first change it works fine.

Thanks

Terraform Version, ArgoCD Provider Version and ArgoCD Version

Terraform version: v1.11.3
ArgoCD provider version: 7.5.2
ArgoCD version: v2.14.8+a7178be

Affected Resource(s)

From our usage this affects resources of two different types e.g. an application and a project, or an application and a repository. If you just deploy two applications the issue is not present.

Provider Configuration

provider "argocd" {
  core = true
}

Debug Output

The primary error:

transport: error while dialing: error dial proxy: dial tcp [::1]:59353: connect: connection refused

Full log

025-04-03T14:54:59.657+0100 [ERROR] provider.terraform-provider-argocd_v7.5.2: Response contains error diagnostic: diagnostic_severity=ERROR tf_proto_version=6.8 tf_provider_addr=registry.terraform.io/argoproj-labs/argocd tf_req_id=42509877-c469-f025-acde-27b1e209a65e tf_resource_type=argocd_project_token @module=sdk.proto diagnostic_detail="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: error dial proxy: dial tcp [::1]:59353: connect: connection refused\"" diagnostic_summary="failed to read project platform" tf_rpc=ReadResource @caller=github.com/hashicorp/[email protected]/tfprotov6/internal/diag/diagnostics.go:58 timestamp="2025-04-03T14:54:59.657+0100"
2025-04-03T14:54:59.657+0100 [TRACE] provider.terraform-provider-argocd_v7.5.2: Served request: tf_provider_addr=registry.terraform.io/argoproj-labs/argocd tf_rpc=ReadResource @caller=github.com/hashicorp/[email protected]/tfprotov6/tf6server/server.go:803 @module=sdk.proto tf_proto_version=6.8 tf_req_id=42509877-c469-f025-acde-27b1e209a65e tf_resource_type=argocd_project_token timestamp="2025-04-03T14:54:59.657+0100"
2025-04-03T14:54:59.657+0100 [ERROR] vertex "module.argocd_project[\"platform\"].module.argocd_project_token[\"test\"].argocd_project_token.instance (orphan)" error: failed to read project platform
2025-04-03T14:54:59.657+0100 [TRACE] vertex "module.argocd_project[\"platform\"].module.argocd_project_token[\"test\"].argocd_project_token.instance (orphan)": visit complete, with errors
2025-04-03T14:54:59.657+0100 [TRACE] dag/walk: upstream of "root" errored, so skipping
2025-04-03T14:54:59.657+0100 [TRACE] vertex "module.argocd_project.module.argocd_project_token.argocd_project_token.instance (expand)": dynamic subgraph encountered errors: failed to read project platform
2025-04-03T14:54:59.657+0100 [ERROR] vertex "module.argocd_project.module.argocd_project_token.argocd_project_token.instance (expand)" error: failed to read project platform

Steps to Reproduce

  1. Start argocd inside a Kubernetes cluster e.g. rancher or minikube
  2. Set the ~/.kube/config context's namespace to 'argocd'
  3. Copy the same provider config where 'core = true' is set
  4. Try to apply > 1 ArgoCD resource at once. E.g. two applications. It fails on the second one.

Expected Behavior

terraform applies / destroys without fault

Actual Behavior

terraform fails and no only one change is made for apply, or if destroying no changes are made.

Important Factoids

ArgoCD is running inside Rancher

References

N/A

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

SteveJ-IT avatar Apr 03 '25 15:04 SteveJ-IT

Let me look into this tomorrow. We might have the same issue in our CI/CD, at least I see something very similar in our tests.

the-technat avatar Apr 03 '25 16:04 the-technat

I also had the same issue, would be good to see this fixed, so we can continue to use the Argo terraform provider instead of having to fallback to argo CLI

vazkarvishal avatar Apr 04 '25 08:04 vazkarvishal

I can reproduce the issue locally with ease, using the following snippet:

terraform {
  required_providers {
    argocd = {
      source  = "argoproj-labs/argocd"
      version = "7.5.2"
    }
  }
}

provider "argocd" {
  core = true
}

resource "argocd_project" "x" {
  metadata {
    name      = "x"
    namespace = "argocd"
  }

  spec {
    description = "Project X"

    source_namespaces = ["argocd"]
    source_repos      = ["*"]

    destination {
      server    = "https://kubernetes.default.svc"
      namespace = "*"
    }
    cluster_resource_whitelist {
      group = "*"
      kind  = "*"
    }
  }
}

resource "argocd_application" "example" {
 metadata {
    name      = "guestbook"
    namespace = "argocd"
  }

  cascade = false 
  wait    = false

  spec {
    project = "default"

    destination {
      server    = "https://kubernetes.default.svc"
      namespace = "foo"
    }

    source {
      repo_url = "https://github.com/argoproj/argocd-example-apps.git"
      path            = "guestbook"
      target_revision = "master"
    }

    sync_policy {
      automated {
        prune       = true
        self_heal   = true
        allow_empty = true
      }
      sync_options = ["Validate=false","CreateNamespace=true"]
    }
  }
}

I tested with the acceptance testing environment (make testacc_prepare_env).

Interestingly when I create a repo & project the error doesn't seem to appear. Only when creating a project & app. Other combinations haven't been tested so far.

the-technat avatar Apr 04 '25 10:04 the-technat

@the-technat Thanks for providing a code sample to demonstrate the issue. I did not get round to this as a lot of our code is written in terraform generics where I'm encountering the issue, and you've saved me the time of writing an example outside of this

SteveJ-IT avatar Apr 04 '25 18:04 SteveJ-IT

Glad if that was already helpful. I just haven't had time to fully dig into the code and analyze where the issue exactly appears. So far looking at the debugger it seems like it appears in the read and create function of the application resource, bug the bug doesn't seem to be triggered in every case. If I slowly go through the code line by line the create might eventually succeed 🙈.

the-technat avatar Apr 05 '25 10:04 the-technat

Just an update, I've found deploying applications are fine for the most part, but in one example I have a repository and application deployed. Any subsequent terraform plans return

╷
│ Error: failed to read application foobar
│ 
│   with module.application["foobar"].argocd_application.instance,
│   on helpers/argocd_application/main.tf line 1, in resource "argocd_application" "instance":
│    1: resource "argocd_application" "instance" {
│ 
│ rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout

We also are encountering if you create two repositories and then try to remove both at the same time it will fail and leave Terraform in a position where the state file is referencing a repository that does not exist in ArgoCD.

Plan: 0 to add, 0 to change, 2 to destroy.

Changes to Outputs:
  ~ argocd_repositories   = (sensitive value)
module.repository["nginx-stable"].argocd_repository.instance: Destroying... [id=https://helm.nginx.com/stable]
module.repository["test"].argocd_repository.instance: Destroying... [id=ssh://[email protected]:2222/admin/test.git]
module.repository["test"].argocd_repository.instance: Destruction complete after 2s

│ Error: failed to delete repository https://helm.nginx.com/stable
│ 
│ rpc error: code = Unknown desc = Timed out waiting for settings cache to sync

At this point the nginx-stable repository was actually deleted from ArgoCD but terraform does not receive notice that it was.

terraform apply --auto-approve
module.repository["nginx-stable"].argocd_repository.instance: Refreshing state... [id=https://helm.nginx.com/stable]

Planning failed. Terraform encountered an error while generating this plan.

│ Error: failed to read repository https://helm.nginx.com/stable
│ 
│   with module.repository["nginx-stable"].argocd_repository.instance,
│   on helpers/argocd_repository/main.tf line 1, in resource "argocd_repository" "instance":
│    1: resource "argocd_repository" "instance" {
│ 
│ rpc error: code = PermissionDenied desc = permission denied

Terraform fails trying to read information on a repository that is no longer existant. Not sure if these are related to the original issue.

As a temporary workaround we are deploying to ArgoCD using the declarative setup for Projects, Repositories and Applications via the Kubernetes provider, and then this Argo provider for the remaining resources which seems to work well.

Hopefully these reports are useful though in finding the issue.

SteveJ-IT avatar Apr 08 '25 09:04 SteveJ-IT

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 09 '25 13:06 github-actions[bot]

We are experiencing a similar issue on our CI/CD process, the resource is created successfully the first time, but if then terraform plan is executed we are receiving an issue like the following example:

╷
│ Error: failed to read application metrics-server
│
│   with argocd_application.metrics_server,
│   on metrics-server.tf line 1, in resource "argocd_application" "metrics_server":
│    1: resource "argocd_application" "metrics_server" {
│
│ rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: error dial proxy: dial tcp
│ 127.0.0.1:38511: connect: connection refused" 

jpatavahi avatar Jun 13 '25 20:06 jpatavahi

@SteveJ-IT your permission denied error sounds more like #416 or a related issue. Where as the other ones could be related. I'm assuming you are also using the provider with --core or port-forwarding?

the-technat avatar Jun 16 '25 06:06 the-technat

@jpatavahi you're error also sounds familiar to what we experience in our CI env a lot. That is surely related to the argocd-server pod being gone and the provider not being smart enough to quickly pick a new pod to port-forward and thus failing with connection timeouts. I assume that in CI there is a higher chance of pods or deployments restarting and thus the error appears more frequently. We also have this behavior if we updated Argo CD in the same TF run as we want to update applications deployed with this provider.

I'm assuming that this is an isolated problem not related to the --core option. Thus I'll open a separate issue for this.

the-technat avatar Jun 16 '25 06:06 the-technat

I think I found an interesting pattern. The core mode starts a local Argo CD api-server instance for every resource's CRUD functions and uses this to communicate (and the api-server in turn modifies K8s objects in the back). These api-server processes seem to only life as long as one resource takes to deploy, however I suspect that the connection string for them is a shared variable. That would explain the strange behavior where one resource is created immediately and the other one is stuck because once the first one was created the local process is terminated and the request from the other resource basically gone (and because the connection string to reference the local process is a shared variable it used the already existing local api-server from the other resource).

@onematchfox maybe you can elaborate/verify if that's actually the case (my provider knowledge somewhat ends here)? If so we'd probably need to move this local api-server process somewhere central so it's only started once or try to decouple these processes even more?

the-technat avatar Jun 16 '25 16:06 the-technat

Re-tested this on 7.11.2 and the error is still present for > 1 application create/update/delete in one go using 'core'

Plan: 2 to add, 0 to change, 2 to destroy.
module.application["test-one"].argocd_application.instance: Destroying... [id=test-one:argocd]
module.application["test-two"].argocd_application.instance: Destroying... [id=test-two:argocd]
module.application["test-one"].argocd_application.instance: Destruction complete after 1s
module.application["test-two"].argocd_application.instance: Destruction complete after 1s
module.application["test-one"].argocd_application.instance: Creating...
module.application["test-two"].argocd_application.instance: Creating...

╷
│ Error: failed to list existing applications when creating application test-one
│ 
│   with module.application["test-one"].argocd_application.instance,
│   on helpers/argocd_application/main.tf line 46, in resource "argocd_application" "instance":
│   46: resource "argocd_application" "instance" {
│ 
│ rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: error dial proxy: dial tcp [::1]:50929: connect:
│ connection refused"
╵
╷
│ Error: failed to list existing applications when creating application test-two
│ 
│   with module.application["test-two"].argocd_application.instance,
│   on helpers/argocd_application/main.tf line 46, in resource "argocd_application" "instance":
│   46: resource "argocd_application" "instance" {
│ 
│ rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: error dial proxy: dial tcp [::1]:50929: connect:
│ connection refused"

SteveJ-IT avatar Nov 04 '25 08:11 SteveJ-IT