kubo-release icon indicating copy to clipboard operation
kubo-release copied to clipboard

kubectl exec/port-forward don't play nice when using corporate DNS

Open grenzr opened this issue 5 years ago • 3 comments

What happened:

My kubo test cluster in Azure looks like this:

Instance                                     Process State  AZ  IPs           VM CID                                                                             VM Type        Active
master/a7637fc0-0e2f-45dd-8e01-00d7e2e51496  running        z1  10.255.10.8   agent_id:6259000d-d684-4025-9c79-778c8350376d;resource_group_name:banana-env-cfcr  small          true
worker/088a2160-2eb9-4a2a-9d31-63ca4ee73982  running        z1  10.255.10.7   agent_id:e9053429-7cef-4844-926a-cdf12de6e90d;resource_group_name:banana-env-cfcr  small-highmem  true
worker/e738cf13-120b-4c07-87c3-c3e86078b013  running        z3  10.255.10.10  agent_id:b8d042ec-5037-4a5b-9d78-23427a06cf15;resource_group_name:banana-env-cfcr  small-highmem  true
worker/ef7d509b-502c-411b-8c89-8f3d55cb5259  running        z2  10.255.10.9   agent_id:8e410ac7-69a4-4311-a0d7-8b4fba44e15e;resource_group_name:banana-env-cfcr  small-highmem  true

When attempting to kubectl exec or kubectl port-forward in the cluster, I am greeted with the server misbehaving error instead of being dropped into/forwarded to ports in a running container:

k exec -it kibana-logging-es-cluster-56d6dfcbdb-xx5fb /bin/bash
Error from server: error dialing backend: dial tcp: lookup e9053429-7cef-4844-926a-cdf12de6e90d on 192.168.140.69:53: server misbehaving

What you expected to happen:

The usual action of both those commands.

How to reproduce it (as minimally and precisely as possible):

We have built the latest kubo-release from master, including a cherry-picked commit from the following PR: https://github.com/cloudfoundry-incubator/kubo-release/pull/257

In our Azure subscription, we must use the corporate DNS server (ie. not Azure DNS) enforced on the vnet (this is corporate policy currently, and not in my control)

@andyliuliming informs me that machines created in an Azure DNS based vnet using the instance agentid as the machine name automatically, thus DNS resolution for the above commands isn't an issue in those circumstances.

I was/still am quite excited at the fact that bosh operates its own dns inside the walls of the vnet so had hoped that this would provide sufficient cover to ensure internal machine resolutions without having to go out onto the corporate dns, but unfortunately it appears bosh-dns doesn't currently resolve machines by raw agentid (a conversation about bosh-dns resolutions, including with raw agentid here: https://github.com/cloudfoundry/bosh-dns-release/issues/30).

I have been wondering whether https://bosh.io/docs/dns/#aliases might help here as a workaround, but I'd also like to get a handle on how the bosh team are feeling about supporting agentid resolutions, as I feel using bosh-dns to solve this problem seems like the right place, and would be really advantageous over other k8s deployment solutions.

Anything else we need to know?:

Environment:

  • Deployment Info (bosh -d <deployment> deployment):
Name       Release(s)         Stemcell(s)                                    Config(s)          Team(s)
azurecfcr  bosh-dns/1.10.0    bosh-azure-hyperv-ubuntu-xenial-go_agent/0000  3 cloud/azurecfcr  -
           bpm/0.13.0                                                        9 cloud/default
           cfcr-etcd/1.5.0                                                   2 runtime/dns
           docker/32.1.0
           kubo/0.23.0+dev.4 <-- current master + cherry-picked commit outlined above

1 deployments

The Stemcell here is the latest xenial build from latest master branch of bosh-linux-stemcell-builder

  • Environment Info (bosh -e <environment> environment):
Name      bosh-banana-env
UUID      f6cf854f-8767-4083-9758-f744fc2d23fb
Version   268.0.1 (00000000)
CPI       azure_cpi
Features  compiled_package_cache: disabled
          config_server: enabled
          dns: disabled
          snapshots: disabled
User      admin
  • Kubernetes version (kubectl version): 1.11.3
  • Cloud provider (e.g. aws, gcp, vsphere): azure

grenzr avatar Nov 05 '18 15:11 grenzr

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/161722421

The labels on this github issue will be updated when the story is started.

cf-gitbot avatar Nov 05 '18 15:11 cf-gitbot

There are three ways to fix this one @grenzr

  1. Customer enables DDNS registration on their DNS server for their K8s clusters (or deploys an intermediate DNS server) so that the nodes are resolvable. Each Kubo node needs to do an nsupdate for the DDNS.

This is automatic when using Azure DNS as it is tied to the DHCP on the instance

  1. The BOSH dns team enables a feature where agent IDs are resolvable for bosh nodes. That’s a longer conversation.

  2. You do a BOSH DNS aliases hack like my original PR.

My thinking is #1 would be the easiest.

svrc avatar Nov 05 '18 17:11 svrc

Thanks for the feedback @svrc-pivotal , that's pretty much what I was thinking too.

I'll try 3 first I think, as I'd prefer to keep all the deployment resolutions insulated and local in that vnet. But if that doesn't work out well, I've always got option 1. in my back pocket which I know will work as have already done that during a spike of a different k8s deployment engine.

Perhaps I'll put my 2c worth in cloudfoundry/bosh-dns-release#30 as well and see where it goes, but its good you can use the dns aliases functionality already now.

grenzr avatar Nov 06 '18 16:11 grenzr