Document Route53 and Google DNS troubleshooting
Problem Statement
Configuring DNS correctly can be challenging. Understanding how DNS works with kops is documented, but I think we can get a lot of tribal knowledge documented as well. Also, mention that kops gossip is an option to not use DNS.
Purpose
Document how to diagnosis and troubleshoot DNS configuration with Route53 and Google Cloud DNS.
Components involved
DNS / Cloud
- Domain Registrars
- DNS Providers - Route53 and Google Cloud
kops ecosystem
- kops - creates placeholder DNS entries with the 203 IP address
- protokube - configures etcd DNS records
- dns-controller k8s deployment - configures cluster API endpoint DNS record
Validating Provider setup
End of the day dig ns subdomain.example.com this has to work. The Route53 or google cloud ids may need to be used with kops.
how kops does DNS
- flags on kops to set domain
- user runs kops and route53 domain entries are created
- master node(s) are created and protokube container is started
- protokube creates dns records for etcd
- once the k8s cluster master(s) are stable dns-controller deployment is started
- dns-controller deployment starts and updates api endpoint DNS record
Diagnosis Tools
- Cloud consoles - Route53 and Google Cloud
- dig
- logs from protokube
- logs from dns-controller
- aws cli
- gcloud cli
kops current documentation
aws tutorial https://github.com/kubernetes/kops/blob/master/docs/aws.md#configure-dns
dns-controller documentation https://github.com/kubernetes/kops/blob/9c1ef822ab9766091491826bcdea162261bc3bdd/dns-controller/README.md
creating a sub-domain https://github.com/kubernetes/kops/blob/master/docs/creating_subdomain.md
external documentation
- http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/CreatingNewSubdomain.html
- https://cloud.google.com/dns/quickstart
- https://cloud.google.com/appengine/docs/standard/python/mapping-custom-domains
- https://aws.amazon.com/route53/faqs/
Related Issues / Comments / PRs
- https://github.com/kubernetes/kops/issues/762#issuecomment-257365957
- https://github.com/kubernetes/kops/issues/1230
- https://github.com/kubernetes/kops/issues/3273
- https://github.com/kubernetes/kops/issues/1386
Lastly a PR that got closed
https://github.com/justinsb/kops/blob/a09edc22d6b3a070f828e2e69ac9d4bde0cfe534/docs/tour/dns.md
Gaps
We do not have ANY documentation for google DNS.
/area dns /area documentation
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/lifecycle frozen /remove-lifecycle stale
Problem: I was trying to set up kubernetes using kops on AWS and got the following error:
unexpected error during validation: unable to resolve Kubernetes cluster API URL dns: lookup example.com on 127.0.0.53:53: server misbehaving
My A records started with 203.0.113.123, the kubernetes placeholder record, after ~10 minutes all my A records were updates except from the two first:
Route53 IP addresses: api.example.com: 203.0.113.123 api.internal.example.com: 203.0.113.123 etcd-a.internal.example.com: 172.20.39.255 etcd-b.internal.example.com: 172.20.68.11 ...
Root problem: Route53 configured my Hosted zones and Registered domains with different name servers.
Solution: Change the four name servers in your Registered domains to match your root domain name in the Hosted zones, e.g., ns-25.awsdns-1.com, ns-12.awsdns-09.co.uk, ns-12.awsdns-09.net, ns-67.awsdns-88.org.
This takes about 3 hours. You can see the progress at https://dnschecker.org/ or use the command line "dig ns example.com". Once it's done, delete the cluster and create it again. You will see the same error message the first 10 minutes. But after 10min you'll see the A records update for api.example.com and api.internal.example.com.
Note: You will also see this error even if you have the correct name servers. If you have the correct name servers, wait 10-15 min and it will work.
I have nearly the same problem, but I couldn't find a solution for it and I'm not sure what is the cause for it
Here are the steps that I have used:
- I have created a cluster using the following cli
kops create cluster \
--cloud=aws \
--node-count=2 \
--node-size=t2.medium \
--zones=eu-west-1a \
--master-size=t2.medium \
--master-zones=eu-west-1a \
--dns-zone=k8s.domain.com \
--name=dev.k8s.domain.com \
--topology=private \
--networking=weave \
--cloud-labels="Env=Dev" \
--state=s3://domain-dev-kops-state-store \
--ssh-public-key=~/.ssh/id_rsa.pub \
--yes
- I have created 2 hosted zones on AWS Route53: domain.com . ---> Parent domain k8s.domain.com ---> child domain --> cluster domain with the NS from subdomain added to the parent domain
The cluster is up, but any resources within the cluster that is exposed as a loadbalancer type the URL given by the cluster something like
a5df3af45733411e99464025095a7ed8-300168090.eu-west-1.elb.amazonaws.com
doesn't work and gives page not found
I have tried the above steps multiple times without success, any advice for the above?
/assign