boundary icon indicating copy to clipboard operation
boundary copied to clipboard

Worker cannot connect to Controller while upgrade to 0.8.0

Open incubator4 opened this issue 2 years ago • 5 comments

Previously My company has been using Boundary as a secure connection tool for a long time since version 0.6.2 We use aws kms and postgresql for Dependency, and deploy controlller and a few workers in different regions as K8S Deployment in EKS Cluster. It workers well for now. Because there is a new release version, I decide to upgrade our infrastructure to latest release version 0.8.0.

Describe the bug I follow the Document Upgrade and Database Migration . Backup database -> scale controller deployment to replicas: 0 -> run the migration Job -> Upgrade the Controller pod image from 0.6.2 to 0.8.0

~Actually, I use to allocated less resources to the controller, then the plugin load faild again and again.~ ~Then I found similar problem in #1813 , and I allocate more resource to controller and the problem was solved~ Maybe the error message can be more friendly in feature.

It seems works well until now.But workers cannot connect to controller with following log.

Try to connect

{
  "id": "hQu6GKChpn",
  "source": "https://hashicorp.com/boundary/canary-hk-eks-1_boundary-worker-6f4fdd8db5-x8pxg",
  "specversion": "1.0",
  "type": "system",
  "data": {
    "version": "v0.1",
    "op": "worker.(Worker).createClientConn",
    "data": {
      "address": "<hidden-controller-address>:9201",
      "msg": "connected to controller"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2022-05-07T20:10:15.386411333Z"
}

Failure log

It keeps reporting errors.

{
  "id": "wYzqTUlkSy",
  "source": "https://hashicorp.com/boundary/canary-hk-eks-1_boundary-worker-6f4fdd8db5-x8pxg",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing unable to dial to controller: dial tcp: lookup <available address> on 172.20.0.10:53: no such host\"",
    "error_fields": {},
    "id": "e_7X4lj1saof",
    "version": "v0.1",
    "op": "worker.(Worker).sendWorkerStatus",
    "info": {
      "msg": "error making status request to controller"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2022-05-07T20:10:18.889134769Z"
}

Workers cannot connect to controller use both of version 0.6.2 and 0.8.0,then I rollback upgrade and restore database.

To Reproduce Steps to reproduce the behavior:

  1. Upgrade controller from 0.6.2 to 0.8.0
  2. See error

incubator4 avatar May 08 '22 14:05 incubator4

FWIW; tcp: lookup <available address> on 172.20.0.10:53: no such host looks like a DNS issue. Whatever <available address> is (like boundary-controller01.mycompany.net) -- dns lookup is failing to find it.

That may not be the root issue though; if you are using a DNS-based loadbalancer in front of your controllers and your health checks are failing that may show up as that since there are no healthy hosts.

This could potentially be related to #2072 which could happen if your are using port 9201 as your health check endpoint and the load balancer is hitting it with an unexpected packet body - causing it to crash/stop listening.

justenwalker avatar May 11 '22 20:05 justenwalker

Thanks for raising this @justenwalker - I won't rule out #2072 being related here, but just to be sure since this does look DNS related, can you exec into the worker container and run a nslookup or telnet to the IP it's unable to connect to?

malnick avatar May 11 '22 21:05 malnick

@incubator4 raised the issue, so they'd have to try this. Just added comment to the other issue since I encountered this problem because of loadbalancer health checks; so it seems plausibly related.

justenwalker avatar May 11 '22 21:05 justenwalker

FWIW; tcp: lookup <available address> on 172.20.0.10:53: no such host looks like a DNS issue. Whatever <available address> is (like boundary-controller01.mycompany.net) -- dns lookup is failing to find it.

That may not be the root issue though; if you are using a DNS-based loadbalancer in front of your controllers and your health checks are failing that may show up as that since there are no healthy hosts.

This could potentially be related to #2072 which could happen if your are using port 9201 as your health check endpoint and the load balancer is hitting it with an unexpected packet body - causing it to crash/stop listening.

In fact, is an aws external ELB with public address (just like boundary-controller01.mycompany.net). And I use liveness/readiness tcp check with api port with 9200, it might be some dns error. I've seen issue #2072 ,it was similar with my another issue #2062 , i thought both of these might be one question( 0.7.6 works but 0.8.0 not)

incubator4 avatar May 12 '22 08:05 incubator4

Hi there -- has this been addressed in later releases for you?

jefferai avatar Sep 06 '22 13:09 jefferai