traefik icon indicating copy to clipboard operation
traefik copied to clipboard

Traefik gets stuck on HTTP-01 ACME challenge

Open jacksgt opened this issue 2 years ago • 18 comments

Welcome!

  • [X] Yes, I've searched similar issues on GitHub and didn't find any.
  • [X] Yes, I've searched similar issues on the Traefik community forum and didn't find any.

What did you do?

I'm setting up a HTTP-01 based certificate resolver with Let's Encrypt for Traefik. I'm using Traefik v2.6.1.

Unfortunately, external endpoints are reporting that they cannot connect to the server due a timeout, see for example https://letsdebug.net/u9k.de/917912 I have made sure this is not a firewall issue and not an IPv6 issue. The timeout ONLY occurs when hitting the example.com/.well-known/acme-challenge/example123 route, not any other routes (e.g. example.com and example.com/.well-known/ return immediately).

What did you see instead?

When I'm cranking up the logs to DEBUG and manually fetching the challenge endpoint, I can see the following logs lines:

curl -vv -4 http://u9k.de/.well-known/acme-challenge/foobar
*   Trying 65.108.188.114:80...
* Connected to u9k.de (65.108.188.114) port 80 (#0)
> GET /.well-known/acme-challenge/foobar HTTP/1.1
> Host: u9k.de
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Date: Fri, 18 Feb 2022 07:55:36 GMT
< Content-Length: 0
< 
* Connection #0 to host u9k.de left intact
time="2022-02-18T07:45:26Z" level=debug msg="Retrieving the ACME challenge for u9k.de (token \"foobar\")..." providerName=acme
time="2022-02-18T07:46:18Z" level=error msg="Cannot retrieve the ACME challenge for u9k.de (token \"foobar\"): cannot find challenge for token \"foobar\" (u9k.de)" providerName=acme

Note the time between these two lines: it takes Traefik ~45 seconds to reply to this request.

It seems like something is holding an extensive lock in challenge_http.go and the getTokenValue function (which is called from ServeHTTP) is getting stuck there: https://github.com/traefik/traefik/blob/764bf59d4dfff2187cafa319e2b127c7e29fb3d5/pkg/provider/acme/challenge_http.go#L111

What version of Traefik are you using?

level=info msg="Traefik version 2.6.1 built on 2022-02-14T16:50:25Z"

What is your environment & configuration?

I'm using the official Traefik Helm chart to deploy Traefik on Kubernetes. Since I don't think the chart is relevant for this question, I'm posting the Traefik configuration the chart generates - note that these ports are internal and 8000/8443 are port-forwarded from 80/443 externally.

        - --global.checknewversion
        - --global.sendanonymoususage
        - --entryPoints.metrics.address=:9100/tcp
        - --entryPoints.traefik.address=:9000/tcp
        - --entryPoints.web.address=:8000/tcp
        - --entryPoints.websecure.address=:8443/tcp
        - --api.dashboard=true
        - --ping=true
        - --metrics.prometheus=true
        - --metrics.prometheus.entrypoint=metrics
        - --providers.kubernetescrd
        - --providers.kubernetesingress
        - --entrypoints.websecure.http.tls=true
        - [email protected]
        - --certificatesResolvers.le.acme.storage=/data/acme.json
        - --certificatesResolvers.le.acme.httpChallenge.entryPoint=web
        - --certificatesResolvers.le.acme.caServer=https://acme-staging-v02.api.letsencrypt.org/directory
        - --log.level=DEBUG
        - --providers.kubernetesingress.throttleDuration=15s
        - --providers.kubernetesIngress.allowEmptyServices=true

If applicable, please paste the log output in DEBUG level

Log output:

time="2022-02-18T07:54:52Z" level=debug msg="Unable to split host and port: address u9k.de: missing port in address. Fallback to request host." providerName=acme
time="2022-02-18T07:54:52Z" level=debug msg="Retrieving the ACME challenge for u9k.de (token \"foobar\")..." providerName=acme
time="2022-02-18T07:54:52Z" level=error msg="Error getting challenge for token retrying in 603.851468ms" providerName=acme
time="2022-02-18T07:54:53Z" level=error msg="Error getting challenge for token retrying in 759.621068ms" providerName=acme
time="2022-02-18T07:54:53Z" level=error msg="Error getting challenge for token retrying in 1.437936942s" providerName=acme
time="2022-02-18T07:54:55Z" level=error msg="Error getting challenge for token retrying in 2.485258498s" providerName=acme
time="2022-02-18T07:54:57Z" level=error msg="Error getting challenge for token retrying in 1.902903522s" providerName=acme
time="2022-02-18T07:54:59Z" level=error msg="Error getting challenge for token retrying in 2.616926306s" providerName=acme
time="2022-02-18T07:55:02Z" level=error msg="Error getting challenge for token retrying in 5.436445884s" providerName=acme
time="2022-02-18T07:55:07Z" level=error msg="Error getting challenge for token retrying in 7.213840431s" providerName=acme
time="2022-02-18T07:55:15Z" level=error msg="Error getting challenge for token retrying in 10.993923627s" providerName=acme
time="2022-02-18T07:55:26Z" level=error msg="Error getting challenge for token retrying in 10.487972011s" providerName=acme
time="2022-02-18T07:55:36Z" level=error msg="Cannot retrieve the ACME challenge for u9k.de (token \"foobar\"): cannot find challenge for token \"foobar\" (u9k.de)" providerName=acme

jacksgt avatar Feb 18 '22 08:02 jacksgt

Hello @jacksgt,

Thanks for your interest in Traefik!

The HTTP challenge requires the port of the entry point specified on the resolver challenge configuration to be 80 (the documentation is mentioning it).

This is why the challenge cannot proceed further and timed out.

For help with configuration, please join our Community Forum and reach out to us on the Traefik section.

We will close this issue accordingly but feel free to re-open it if you think that we missed something.

rtribotte avatar Feb 18 '22 08:02 rtribotte

@rtribotte Traefik is only configured to listen on 8000, because it's running in a Kubernetes pod (I'm using the official Traefik Helm chart). Sorry for not mentioning this in my original post - I have updated it.

Traefik is exposed to the outside on Port 80 - you can check this yourself with curl -i http://u9k.de.

Please reopen this issue.

jacksgt avatar Feb 18 '22 09:02 jacksgt

Hello @jacksgt,

Unfortunately, external endpoints are reporting that they cannot connect to the server due a timeout, see for example https://letsdebug.net/u9k.de/917912

As per the error report you shared, it seems that port 80 is not accessible from Let's Encrypt. It could come from a lot of things, maybe there is something wrong with the load balancing configuration.

The timeout ONLY occurs when hitting the example.com/.well-known/acme-challenge/example123 route, not any other routes (e.g. example.com and example.com/.well-known/ return immediately).

What do you mean by the timeout occurs when hitting a specific path?

Can you provide the full debug log (from Traefik startup)?

rtribotte avatar Feb 18 '22 16:02 rtribotte

Hi @rtribotte , thanks for taking a look again.

As per the error report you shared, it seems that port 80 is not accessible from Let's Encrypt. It could come from a lot of things, maybe there is something wrong with the load balancing configuration.

The port is definitelty accessible, verified with multiple external website testing tools. You can also try it yourself: http://u9k.de:80.

The timeout ONLY occurs when hitting the example.com/.well-known/acme-challenge/example123 route, not any other routes (e.g. example.com and example.com/.well-known/ return immediately).

What do you mean by the timeout occurs when hitting a specific path?

  • u9k.de/ => returns immediately (in my case redirect to HTTPS)
  • u9k.de/.well-known => returns immediately (404)
  • u9k.de/any-other-path => returns immediately
  • u9k.de/.well-known/acme-challenge/TOKEN => hangs for a long time with the log shown above. In fact it hangs so long that most clients (e.g. Let's Encrypt and Let's debug) consider it a "timeout".

Can you provide the full debug log (from Traefik startup)?

Unfortunately that's a bit difficult, because there are various other things going on in the same Traefik instance, so it's rather noisy.

jacksgt avatar Feb 18 '22 19:02 jacksgt

Getting a similar errror with the "latest" version. Got pulled automatically and I basically woke up to my site being broken... Only fix was going back to 2.6.0.

time="2022-02-22T19:23:12Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2022-02-22T19:23:35Z" level=error msg="Error renewing certificate from LE: {example.com [*.example.com]}, error: one or more domains had a problem:\n[example.com] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: During secondary validation: Incorrect TXT record \"K_NYS08yis0aQjTCckDCiwZBwi-7CnmSqcmhaqK31RI\" found at _acme-challenge.example.com\n" providerName=cfresolver.acme ACME CA="https://acme-v02.api.letsencrypt.org/directory"

hexxone avatar Feb 22 '22 20:02 hexxone

Hello @hexxone,

The issue you are facing is not the same, per the error message, it seems that you are using DNS-01 challenge and not HTTP-01 challenge.

kevinpollet avatar Feb 23 '22 08:02 kevinpollet

I think you have a workaround to this issue (which is really a k8s issue I think) described in https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/2601#issuecomment-1050406952, and a feature request to resolve this more properly via https://github.com/traefik/traefik/issues/8803.

With the workarounds, you should be able to conclude if it is or isn't a k8s issue.

consideRatio avatar Feb 25 '22 00:02 consideRatio

Hi, I add my comment to this issue because #8803 is already closed.

I have had the exact same issues as described in this and the already mentioned other issue: when traefik starts and there are certificate challenges to issue, they are being requested very early in the lifetime of the traefik process. If the requests coming from letsencrypt cannot be answered by traefik yet, the above mentioned error happens.

I have now debugged for several days, and also stumbled over these two issues. I have tried the workarounds by adding an init container that waits 20 seconds before starting the actual traefik container, and I have even rebuilt the traefik image to sleep 20 seconds before the traefik process is being started, both without any success.

After analyzing with tcpdump I had the idea that brought me the understanding: in my specific setup (k3s with k3s' included "serviceLB" and exposed traefik using a LoadBalancer service, which should be pretty much standard, as it is the default setup in K3S out of the box), the TCP port of the Loadbalancer service (80, ...) becomes responding to a TCP connect not before 7 seconds after Traefik process (PROCESS, not container, nor pod) started.

This means, that for me, and for all others using the same setup, LetsEncrypt ACME works, but those certificates that were open to be issued and traefik has been restarted (config changes, initial setup) will never be issued. This has cost me huge amount of time, and I am sure others will have the same issue.

I understand that built-in "delays" of any kind are not a beautiful solution, but they could solve a huge problem. Another possible idea would be to retry challenges which ended with the 400 error after some configurable time (once, not periodically, because of LE rate limit).

PLEASE, we really need a solution for this problem. It is common in Kubernetes and it is NOT in all setups solvable by adding an init container to wait a bit.

Thanks

sa-ChristianAnton avatar May 25 '22 20:05 sa-ChristianAnton

I came across exactly the same problem in GKE. The initcontainer workaround doesn't seem to work because apparently the LoadBalancer service or whatever is needed to forward the requests to Traefik seems to depend on the pod being ready.

I had some more luck and it works from time to time by removing the readinessProbe to get the pod to "Ready" state immediately, but again, it is failing quite often.

airadier avatar Jul 01 '22 16:07 airadier

By default, the helm chart configures the traefik Service as a load balancer, and configures the readiness probe on the pods with an initial delay of 10 seconds. Traefik attempts the ACME challenge within those 10s, so the request from LE never connects, and the LB doesn't seem to complete the connection once ready. So the "

I am able to get certificate requests working consistently by removing the readiness probe from the deployment. This causes the LB to immediately allow traffic to traefik. I'm not sure if this is best-practice, but since everything is unreachable if traefik is, I'm not sure what value it provides in this case.

I deploy the helm chart with Kustomize, and use this patch to remove the probe:

patches:
  - patch: |-
      - op: remove
        path: /spec/template/spec/containers/0/readinessProbe
    target:
      version: v1
      kind: Deployment
      name: dev-traefik # Your release name

alexjball avatar Sep 05 '22 16:09 alexjball

Well, I'm having the very same issue with plain docker-compose. No kubernetes. I have re read this thread many times, but I don't see a solution. Did I miss something?

By "same issue", I mean requests to /.well-known/acme-challenge/XXXXX timeout, all the rest works.

The docker-compose is straight from the photoprism template https://docs.photoprism.app/getting-started/proxies/traefik/

kubrickfr avatar Jan 23 '23 22:01 kubrickfr

Hey everyone, can you share your environment and steps to reproduce?

The symptoms y'all have indicate to us that it is unlikely there is a bug on the ACME HTTP Challenge. Sharing a minimal but thorough reproducible case would let us verify this and potentially help you out even if there is not a bug.

tfny avatar Jan 26 '23 15:01 tfny

Well, I'm having the very same issue with plain docker-compose.

Me too. Fresh setup, no idea what's wrong at the moment. There are no load balancers. It even happens when I do it on the machine via localhost:

frederik@server:~$ curl http://localhost/.well-known/acme-challenge/TOKEN
^C # Takes forever
frederik@server:~$ curl http://localhost/
404 page not found # Returns immediately

Traefik is running in docker compose and listening on port 80 (and 443)

Edit: Ah my problem was an active https redirect on the traefik service itself. After disabling the redirect it worked.

FrederikP avatar Feb 14 '23 18:02 FrederikP

Can confirm that Traefik needs a long time to return a request for non existing acme challenges.

The problem for me is I get spammed daily by bots that trigger random acme challenges. This results in hundreds of validation errors in my log files.

This currently makes validation errors completely pointless because I can't differentiate between real ones from Let's Encrypt and fake requests from bots.

In my opinion Traefik should not output failed validation errors for acme challenges that Traefik never initiated. These requests should just be ignored and return 404 or something like that instead of error messages.

https://www.abuseipdb.com/check/3.144.155.159 https://www.abuseipdb.com/check/54.149.111.225 https://www.abuseipdb.com/check/3.129.66.5

These are just a couple of ip adresses that spam me. Every single one of them has been reported for this exact behaviour. I currently have a list of 12 IP Adresses all belonging to Amazon that I collected over the weekend.

To reproduce this simply make a GET request to: domain.tld/.well-known/acme-challenge/some-random-token-it-doesnt-matter

Logs: Traefik logs showing the error messages

Edit: The request I made to show the log output here took 53.06s to return a 404

flopana avatar Jul 24 '23 10:07 flopana

Hello @flopana,

It seems that you are describing an issue that is not related to this one. Can you please open a new issue?

rtribotte avatar Jul 24 '23 12:07 rtribotte

@rtribotte sure

flopana avatar Jul 24 '23 13:07 flopana