cert-manager icon indicating copy to clipboard operation
cert-manager copied to clipboard

Allowing skipping HTTP01 and DNS01 self-check on a per-solver basis

Open xmassx opened this issue 5 years ago • 102 comments

Is your feature request related to a problem? Please describe. Kinda intercects with #863, in nat nets cant successfully self validate acme rules, because of local k8s providers, which refuses to create hairpin nat

Describe the solution you'd like Nice env variable or cmd-flag for skip local self check and leave it up to the user

Describe alternatives you've considered Nothing

Environment details (if applicable):

  • Kubernetes version (Any):
  • Cloud-provider/provisioner (Cheap local providers):
  • Install method (Helm):

/kind feature

xmassx avatar Jan 31 '19 19:01 xmassx

This has been discussed before and we've avoided allowing it as we need some way to ensure that the challenge has propagated.

For DNS01, options like --dns01-recursive-nameservers and --dns01-recursive-nameservers-only help users that have DNS restricted environments that use DNS01.

I wonder if we can provide some other means to allow you to complete the self check without disabling it altogether? i.e. by overriding the server that we query for challenges?

/priority awaiting-more-evidence /help

munnerz avatar Feb 07 '19 12:02 munnerz

@munnerz: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

This has been discussed before and we've avoided allowing it as we need some way to ensure that the challenge has propagated.

For DNS01, options like --dns01-recursive-nameservers and --dns01-recursive-nameservers-only help users that have DNS restricted environments that use DNS01.

I wonder if we can provide some other means to allow you to complete the self check without disabling it altogether? i.e. by overriding the server that we query for challenges?

/priority awaiting-more-evidence /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jetstack-bot avatar Feb 07 '19 12:02 jetstack-bot

Yeah, custom server for queries definitely make sense, in dns01 flags you did perfect job, but, i think this can be more confusing in http01. For me, its ok to create flag looks like --http01-external-address=10.0.0.10:80 or something like that for sure. Where user can set alternate service for proxying request to the k8s's public ip But in that scenario user can use some kind of local created service with configured endpoints to the challenges, which de-facto like disabling at all. For me, though, this behaviour perfectly fine, yes

xmassx avatar Feb 09 '19 07:02 xmassx

I'd would like to disable the self-check, too: we have a k8s cluster with different inbound gateways and NAT and we can't hairpin the external DNS name to the correct internal IP in every scenario for every domain name, so internal checks against the external IP will timeout while requests from certbot's servers can read the challenge without problem.

proligde avatar Feb 13 '19 17:02 proligde

We are also having this issue, while the http01 self check could be bypassed with hairpin nat, or an external split horizon DNS. in some cases this can be a real pain (Such as bootstrapping a system, see Rancher 2.0 HA install where you need an external LB)

datagen24 avatar Mar 24 '19 19:03 datagen24

I have the same issue : I cannot use cert-manager because of self-check tests. My router does not support hairpinning

ghost avatar May 01 '19 19:05 ghost

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

retest-bot avatar Jul 30 '19 20:07 retest-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle rotten /remove-lifecycle stale

retest-bot avatar Aug 29 '19 20:08 retest-bot

this would be a great feature. we are experiencing self check failures due to our DNS policy of only allowing internal DNS servers for internal lookups. the self check is the only thing preventing the challenge from completing.

branttaylor avatar Sep 23 '19 19:09 branttaylor

/remove-lifecycle rotten

branttaylor avatar Sep 23 '19 19:09 branttaylor

+1

lhzbxx avatar Oct 05 '19 05:10 lhzbxx

+1

lucax88x avatar Oct 14 '19 12:10 lucax88x

+1

schaerli avatar Oct 14 '19 12:10 schaerli

Given we now better handle backing off when an Order fails, I think we could consider adding this as an option on the ACME solver.

Logically, it seems it'd make sense to make this an option that applies to both DNS01 and HTTP01 solvers.

If someone wants to give implementing this a go, please drop a comment here first so we can firm up the design details 😄

/cc @JoshVanL

munnerz avatar Oct 16 '19 13:10 munnerz

/area acme /area api

munnerz avatar Oct 16 '19 13:10 munnerz

In our case, the node running Nginx Ingress controller somehow is able to visit the HTTP01 endpoint.

So we used podAffinity to schedule cert-manager on the same node and it solves the issue.

yujinyan avatar Nov 05 '19 03:11 yujinyan

Any progress on this? We are running kubernetes in a cloud provider that does not support hairpinning. Without this feature we couldn't deploy cert-manager successfully.

savas127 avatar Dec 10 '19 14:12 savas127

The root problem is in Kubernetes networking if you use LoadBalancer that is provided by the hosting. I use DigitalOcean. Kubernetes is not routing network through LB public interface so there is no adding PROXY protocol header or SSL if you are setting it outside Kubernetes. I use PROXY protocol and the moment when I enable it and update Nginx to handle it everything works but cert-manager fails as it is trying to connect to public domain name and that fails. It works from my computer as I am outside and LB is adding needed headers, but not from within the cluster.

Cert-manager is not guilty for this, but if we can add some switches where we can instruct validator to add PROXY protocol instead to disable validation for that domain it would help some of us a lot.

For curl if I do (from inside the cluster):

curl -I https://myhost.domain.com

it fails.

If I do (from inside the cluster):

curl -I https://myhost.domain.com --haproxy-protocol

it works.

Check this: https://github.com/jetstack/cert-manager/issues/863

MichaelOrtho avatar Dec 17 '19 01:12 MichaelOrtho

The root problem is in Kubernetes networking if you use LoadBalancer that is provided by the hosting. I use DigitalOcean. Kubernetes is not routing network through LB public interface so there is no adding PROXY protocol header or SSL if you are setting it outside Kubernetes. I use PROXY protocol and the moment when I enable it and update Nginx to handle it everything works but cert-manager fails as it is trying to connect to public domain name and that fails. It works from my computer as I am outside and LB is adding needed headers, but not from within the cluster.

Cert-manager is not guilty for this, but if we can add some switches where we can instruct validator to add PROXY protocol instead to disable validation for that domain it would help some of us a lot.

For curl if I do (from inside the cluster):

curl -I https://myhost.domain.com

it fails.

If I do (from inside the cluster):

curl -I https://myhost.domain.com --haproxy-protocol

it works.

Check this: #863

I was informed by DigitalOcean team that there is a fix for this behavior. They added an additional annotation to nxinx-ingress controller service that forces Kubernetes to use domain name of public IP instead of IP and that tricks Kubernetes to think that it is not "ours" and routes network around through LB.

https://github.com/digitalocean/digitalocean-cloud-controller-manager/blob/master/docs/controllers/services/examples/README.md#accessing-pods-over-a-managed-load-balancer-from-inside-the-cluster This is it: (I just added this one)

kind: Service
apiVersion: v1
metadata: 
  name: nginx-ingress-controller
  annotations: 
    service.beta.kubernetes.io/do-loadbalancer-hostname: "hello.example.com"

MichaelOrtho avatar Dec 18 '19 14:12 MichaelOrtho

Hello, i wanna up that issue, my home cluster is behind nat and hairpin not possible with current router. From outside ingress ports fully avaliable and working, but from inside that not works. i have error: Waiting for http-01 challenge propagation: failed to perform self check GET request 'http://domain/.well-known/acme-challenge/ACME': Get http://domain/.well-known/acme-challenge/ACME: dial tcp 109.173.40.107:80: connect: connection timed out link avaliable by internal address (for example if i test via my PC).

Is there is any way to specify address for self-check? or just disable self-check.

WhitePhoera avatar Jan 17 '20 12:01 WhitePhoera

It would also be nice if that could be disabled by on-certificate(request) base.

I have an issue with MetalLB + externalTrafficPolicy: Local where the cert-manager validator cannot access the solver since it's running on a different node than the "proxy" forwarding the requests to the solver.

Any thoughts on this?

MatthiasLohr avatar Feb 14 '20 11:02 MatthiasLohr

I have the same issue as @MatthiasLohr. I recently introduced MetalLB to our cluster and I wasn't expecting certificate requests to stop working.

Does anyone know any workarounds for this?

Note: I'd prefer to keep the self-check, it feels like a good thing to have. Maybe specifying a specific IP adress or Kubernetes Service that should be used instead? This would work for me, for example:

curl -H "Host: master.my-site.com.stage.example.com" nginx-external.ingress.svc.cluster.local/.well-known/acme-challenge/UQEly9jJVXURz9ggFx_6Ckrc4OKT0uBBMUr-3oDsvDA

But that assumes that all my certificate requests for that resolves goes through the same Ingress controller, of course.

EDIT: I assume it's overly complex (and something we don't want to do here) to look at the IP address and see if matches a loadBalancerIP in the cluster, and if it is, use the clusterIP instead?

anton-johansson avatar Mar 02 '20 10:03 anton-johansson

Anton has volunteered to put a design document together for this feature! A big thank you - it'll be great to get input on this document once it's ready from those that require this feature! 😄

/assign @anton-johansson

munnerz avatar Mar 31 '20 12:03 munnerz

What does that mean? Any ETA, when this feature will be available?

MatthiasLohr avatar Mar 31 '20 15:03 MatthiasLohr

@MatthiasLohr I'm currently working on a design document where we can decide the best solution. It'll be up shortly.

anton-johansson avatar Mar 31 '20 15:03 anton-johansson

Thank you! Would be really nice to have this feature as soon as possible, currently the last thing required for a production setup... Trying a lot of workarounds but nothing is really reliable.

MatthiasLohr avatar Mar 31 '20 15:03 MatthiasLohr

I'll do my best to get this included in the v0.15.0 release.

anton-johansson avatar Mar 31 '20 15:03 anton-johansson

Awesome, thanks! If I can help somehow, please let me know.

MatthiasLohr avatar Mar 31 '20 15:03 MatthiasLohr

@MatthiasLohr @WhitePhoera I am in your exact same situation. In my case, to workaround this, I created an internal DNS Zone with an entry matching the cert and pointed it to the IP address managed by MetalLB. It's by no mean a long term solution but at least the certificate validated.

elisiano avatar May 09 '20 20:05 elisiano

I cannot find this feature in the helm chart options of v0.15.0? @anton-johansson did you implement the feature with the latest release?

derdrdirk avatar May 23 '20 10:05 derdrdirk