boundary
boundary copied to clipboard
Worker cannot connect to Controller while upgrade to 0.8.0
Previously
My company has been using Boundary
as a secure connection tool for a long time since version 0.6.2
We use aws kms and postgresql for Dependency, and deploy controlller and a few workers in different regions as K8S Deployment in EKS Cluster.
It workers well for now. Because there is a new release version, I decide to upgrade our infrastructure to latest release version 0.8.0
.
Describe the bug
I follow the Document Upgrade and Database Migration .
Backup database -> scale controller deployment to replicas: 0
-> run the migration Job -> Upgrade the Controller pod image from 0.6.2
to 0.8.0
~Actually, I use to allocated less resources to the controller, then the plugin load faild again and again.~ ~Then I found similar problem in #1813 , and I allocate more resource to controller and the problem was solved~ Maybe the error message can be more friendly in feature.
It seems works well until now.But workers cannot connect to controller with following log.
Try to connect
{
"id": "hQu6GKChpn",
"source": "https://hashicorp.com/boundary/canary-hk-eks-1_boundary-worker-6f4fdd8db5-x8pxg",
"specversion": "1.0",
"type": "system",
"data": {
"version": "v0.1",
"op": "worker.(Worker).createClientConn",
"data": {
"address": "<hidden-controller-address>:9201",
"msg": "connected to controller"
}
},
"datacontentype": "application/cloudevents",
"time": "2022-05-07T20:10:15.386411333Z"
}
Failure log
It keeps reporting errors.
{
"id": "wYzqTUlkSy",
"source": "https://hashicorp.com/boundary/canary-hk-eks-1_boundary-worker-6f4fdd8db5-x8pxg",
"specversion": "1.0",
"type": "error",
"data": {
"error": "rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing unable to dial to controller: dial tcp: lookup <available address> on 172.20.0.10:53: no such host\"",
"error_fields": {},
"id": "e_7X4lj1saof",
"version": "v0.1",
"op": "worker.(Worker).sendWorkerStatus",
"info": {
"msg": "error making status request to controller"
}
},
"datacontentype": "application/cloudevents",
"time": "2022-05-07T20:10:18.889134769Z"
}
Workers cannot connect to controller use both of version 0.6.2
and 0.8.0
,then I rollback upgrade and restore database.
To Reproduce Steps to reproduce the behavior:
- Upgrade controller from
0.6.2
to0.8.0
- See error
FWIW; tcp: lookup <available address> on 172.20.0.10:53: no such host
looks like a DNS issue. Whatever <available address>
is (like boundary-controller01.mycompany.net
) -- dns lookup is failing to find it.
That may not be the root issue though; if you are using a DNS-based loadbalancer in front of your controllers and your health checks are failing that may show up as that since there are no healthy hosts.
This could potentially be related to #2072 which could happen if your are using port 9201 as your health check endpoint and the load balancer is hitting it with an unexpected packet body - causing it to crash/stop listening.
Thanks for raising this @justenwalker - I won't rule out #2072 being related here, but just to be sure since this does look DNS related, can you exec into the worker container and run a nslookup or telnet to the IP it's unable to connect to?
@incubator4 raised the issue, so they'd have to try this. Just added comment to the other issue since I encountered this problem because of loadbalancer health checks; so it seems plausibly related.
FWIW;
tcp: lookup <available address> on 172.20.0.10:53: no such host
looks like a DNS issue. Whatever<available address>
is (likeboundary-controller01.mycompany.net
) -- dns lookup is failing to find it.That may not be the root issue though; if you are using a DNS-based loadbalancer in front of your controllers and your health checks are failing that may show up as that since there are no healthy hosts.
This could potentially be related to #2072 which could happen if your are using port 9201 as your health check endpoint and the load balancer is hitting it with an unexpected packet body - causing it to crash/stop listening.
In fact,boundary-controller01.mycompany.net
).
And I use liveness/readiness tcp check with api port with 9200, it might be some dns error.
I've seen issue #2072 ,it was similar with my another issue #2062 , i thought both of these might be one question( 0.7.6 works but 0.8.0 not)
Hi there -- has this been addressed in later releases for you?