consul-k8s icon indicating copy to clipboard operation
consul-k8s copied to clipboard

`server-acl-init-cleanup` returns error of [dial tcp 172.20.0.1:443: connect: connection refused]

Open shixuyue opened this issue 3 years ago • 3 comments

Question

server-acl-init-cleanup job returns error (connect: connection refused):

2022-07-27T06:12:33.892Z [INFO]  waiting for job "consul-server-acl-init" to complete successfully
Error getting job "consul-resource-manager-server-acl-init": Get "https://172.20.0.1:443/apis/batch/v1/namespaces/<myNS>/jobs/consul-server-acl-init": dial tcp 172.20.0.1:443: connect: connection refused

Where the init job is completed successfully

2022-07-27T06:07:08.848Z [ERROR] Failure: creating agent policy - PUT /v1/acl/policy: err="Put "http://consul-server-0.consul-server.<ns>.svc:8500/v1/acl/policy": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2022-07-27T06:07:08.848Z [INFO]  Retrying in 1s
2022-07-27T06:07:10.637Z [INFO]  Success: creating agent policy - PUT /v1/acl/policy
2022-07-27T06:07:10.649Z [INFO]  Success: creating server token for consul-server-0.consul-server.<ns>.svc - PUT /v1/acl/token
2022-07-27T06:07:10.654Z [INFO]  Success: updating server token for consul-server-0.consul-server.<ns>.svc - PUT /v1/agent/token/agent
2022-07-27T06:07:10.672Z [INFO]  Success: calling /agent/self to get datacenter
2022-07-27T06:07:10.672Z [INFO]  Current datacenter: datacenter=<dc> primaryDC=<dc>
2022-07-27T06:07:10.690Z [INFO]  Success: getting consul-auth-method ServiceAccount
2022-07-27T06:07:10.693Z [INFO]  Success: getting consul-auth-method-token-fv7vt Secret
2022-07-27T06:07:10.702Z [INFO]  Success: creating auth method consul-k8s-component-auth-method
2022-07-27T06:07:10.702Z [INFO]  server-acl-init completed successfully

CLI Commands (consul-k8s, consul-k8s-control-plane, helm)

delete-completed-job

Helm Configuration

global:
    acls:
        manageSystemACLs: true
        bootstrapToken:
            secretName: <consul-name>-master-token
            secretKey: token
    name: <consul-name>
    datacenter: <dc>
    domain: <consul-name>
dns:
    enabled: false
ui:
    enabled: true
client:
    enabled: false
server:
    replicas: 1
    priorityClassName: infrastructure-apps
    service:
        annotations: |
            "consul.hashicorp.com/service-ignore": "true"
    enabled: true
    storageClass: ebs-gp3
    resources:
        limits:
            memory: "2Gi"
            cpu: "500m"
        requests:
            memory: "500Mi"
            cpu: "200m"

Logs

Shown as above, let me know if you need more information

Current understanding and Expected behavior

So, the clean up job should remove the init job so helm knows that the install is successfully completed. However, since clean up job is errored out, helm never received a signal of succeed, it timed out eventually.

Environment details

The image I am using: hashicorp/consul-k8s-control-plane:0.46.0 The helm command I am using: helm upgrade --install --create-namespace --namespace <ns> consul hashicorp/consul

Additional Context

I am suspecting that the container post-install runs too fast before istio-proxy container finishing its work. And the code here doesnt seem to have a retry logic to handle the situation like this: https://github.com/hashicorp/consul-k8s/blob/7423b106e50d420e08751e4c1e9de809983d336f/control-plane/subcommand/delete-completed-job/command.go#L124-L127 it returns code 1 immediately.

Also, the <ns> is istio-injected namespace. I only use consul kv feature at this moment, so client is not required, and one server would be sufficient.

shixuyue avatar Jul 27 '22 06:07 shixuyue

I can somehow confirmed that my guess is correct, I have created a new namespace without istio-proxy injected. Everything works fine. To fix this, we can either add a retry logic instead of erroring out 1 immediately, or we can pass annotation thru values.yaml to these jobs to not inject istio-proxy, which is not currently supported at this moment. Can someone take a look, I am happy to take this task.

shixuyue avatar Jul 27 '22 15:07 shixuyue

I can solve this by updating istio values. Set values.global.proxy.holdApplicationUntilProxyStarts=true is enough. HOWEVER, server-init-job will not quit istio-proxy sidecar container, so the cleanup will never remove the init job. But this is an istio problem rather than consul. I still feel its better to have an additional annotation section in values.yaml. So in the future, people is able to choose not to have istio-proxy injected.

shixuyue avatar Jul 27 '22 18:07 shixuyue

Hi @shixuyue yes that sound like what could be happening. We'd be happy to review a PR that adds annotation support for the acl cleanup job!

lkysow avatar Aug 18 '22 15:08 lkysow

Closing as the acl-init annotation is now implemented through this PR: https://github.com/hashicorp/consul-k8s/pull/2525

david-yu avatar Jul 26 '23 18:07 david-yu