nebari icon indicating copy to clipboard operation
nebari copied to clipboard

[ENH] - Investigate cert-manager

Open pavithraes opened this issue 1 year ago • 11 comments

Feature description

We can look into adding cert-manager: https://cert-manager.io

Value and/or benefit

It'll allow users to get certs from various providers (besides, let's encrypt).

Anything else?

(Originally proposed by @costrouc)

pavithraes avatar Feb 08 '24 14:02 pavithraes

We need to figure out issuers we need to support (other than Let's Encrypt listed as ACEM). Any votes?

pt247 avatar Mar 14 '24 19:03 pt247

The implementation can be broken into several parts, one per Issuer, starting with Let's Encrypt. To resolve this ticket, we should have the following:

  1. Figure out list of issuers we need to support to begin with besides.
  2. A written road map of how we plan to incorporate Issuers. This should involve documenting any Nebari configuration/CLI changes if needed.
  3. Individual tickets for the first few Issuers we intend to support. We should list all high-level changes and dependencies.
  4. POC with Traefik (as ingress) + cert-manager (as certificate manager) + Let's Encrypt (as issuer) + http01 (as solver).

pt247 avatar Mar 18 '24 10:03 pt247

I tried deploying nebari with the following config:

...
...
certificate:
  type: existing
#  acme_email: [email protected]
#  acme_server: https://acme-v02.api.letsencrypt.org/directory
  secret_name: tls-whoami-ingress-http

Installed the cert-manager and its CRDs using the following command:

helm install cert-manager \
    --namespace dev \
    --version v1.14.4 jetstack/cert-manager \
    --set installCRDs=true 

Installed issuer:

kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: default
spec:
  acme:
    email: [email protected]
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: quansight-issuer-account-key
    solvers:
     - http01:
        ingress:
          ingressClassName: traefik
       selector:
         dnsNames:
           - 'whoami.at.quansight.com'
EOF

Created a certificate:

kubectl apply -f - <<EOF
kind: Certificate
apiVersion: cert-manager.io/v1
metadata:
  name: nlb-lab-tls-cert
  namespace: dev
spec:
  commonName: at.quansight.dev
  dnsNames:
    - at.quansight.dev
  duration: 2160h0m0s
  issuerRef:
    name: default
    kind: ClusterIssuer
  renewBefore: 360h0m0s
  secretName: tls-whoami-ingress-http
EOF

I can see that challenge is generated but stuck on Waiting for HTTP-01 challenge propagation: wrong status code '404', expected '200'

Deploying nebari with following config for self-signed:

diff --git a/src/_nebari/stages/kubernetes_ingress/template/modules/kubernetes/ingress/main.tf b/src/_nebari/stages/kubernetes_ingress/template/modules/kubernetes/ingress/main.tf
index 08bb5b2..641d040 100644
--- a/src/_nebari/stages/kubernetes_ingress/template/modules/kubernetes/ingress/main.tf
+++ b/src/_nebari/stages/kubernetes_ingress/template/modules/kubernetes/ingress/main.tf
@@ -2,6 +2,7 @@ locals {
   default_cert = [
     "--entrypoints.websecure.http.tls.certResolver=default",
     "--entrypoints.minio.http.tls.certResolver=default",
+    "--certificatesresolvers.default.acme.httpchallenge=true",
   ]
   certificate-settings = {
     lets-encrypt = [

From the ingress I can see that we have the follwoning challenge url: at.quansight.dev/.well-known/acme-challenge/WP3go40E6EB-RdyD_UKJpskvT5v5mHcDkqZmkiQ2Akk but this is not accessable from internet.

This needs to be routed to the solver pod.

Adding an ingress route:

kubectl apply -f - <<EOF
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: whoami
  namespace: dev
spec:
  entryPoints:
    - /.well-known
  routes:
  - kind: Rule
    match: Host(`whoami.at.quansight.dev`)
    priority: 10
    services:
    - kind: Service
      name: cm-acme-http-solver-9lnm7
      namespace: dev
      passHostHeader: true
      port: 80
      responseForwarding:
        flushInterval: 1ms
      scheme: https
      weight: 10
  tls:
    secretName: tls-whoami-ingress-http
    options:
      name: opt
      namespace: dev
    certResolver: default
    domains:
    - main: at.quansight.dev
EOF

pt247 avatar Mar 29 '24 16:03 pt247

I found this in traefik docs Screenshot 2024-05-15 at 18 42 56 This explains why I could not convince traefik.IngressRoute to auto create certificate. I will need to create an temprary endpoint using kubernetes Ingress for initial certificate creation.

cc: @dcmcand

pt247 avatar May 15 '24 17:05 pt247

Steps to get the cert-manager working:

  1. Start nebari with the following config, note secret_name: tls-ingress-http.
$ tail nebari-config.yaml
  kubernetes_version: '1.29'
  region: eu-west-1
certificate:
  type: existing
  secret_name: tls-ingress-http
dns:
  provider: cloudflare
  auto_provision: true
  1. Install lets-encrypt
 helm install cert-manager --namespace dev --version v1.14.5 jetstack/cert-manager --set installCRDs=true
  1. Apply the following config.
kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: cert-manager-cluster-issuer
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: quansight-issuer-account-key
    solvers:
     - http01:
        ingress:
          class: traefik
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: nlb-lab-tls-cert
  namespace: dev
spec:
  commonName: pt.quansight.dev
  secretName: tls-ingress-http
  issuerRef:
    name: cert-manager-cluster-issuer
    kind: ClusterIssuer
  dnsNames:
    - pt.quansight.dev
EOF

pt247 avatar May 16 '24 11:05 pt247

Certificate configuration in nebari for lets_encrypt.

certificate:
  acme_email: [email protected]
  type: lets-encrypt

For self-signed certificates:

certificate:
  type: existing
  secret_name: tls-ingress-http

Post upgrade to cert-manager. We have a few options:

  1. Maintain the same contract in Nebari config but internally use cert-manager So, the following will use lets-encrypt via the cert-manager.
certificate:
  acme_email: [email protected]
  type: lets-encrypt

The following configuration will behave exactly as it does today.

certificate:
  type: existing
  secret_name: tls-ingress-http
  1. Explicit another type of certificate.
certificate:
  type: cert-manager:
  issuer: lets-encrypt
  acme_email: [email protected]

pt247 avatar May 16 '24 12:05 pt247

"Maintain the same contract in Nebari config"

This makes more sense to me, users don't need to know what underlying tool is used to generate certificates. Ideally this upgrade would have no visible change for the user.

aktech avatar May 16 '24 13:05 aktech

I'm hopeful that cert manager will help with https://github.com/nebari-dev/nebari/issues/2478 for self signed certificate. https://github.com/nebari-dev/nebari/pull/2479 fixes that issue for custom certs (https://github.com/nebari-dev/nebari/pull/2479) but not self signed certs since I haven't figured out how to grab traefik's default self signed cert easily.

Adam-D-Lewis avatar May 22 '24 14:05 Adam-D-Lewis

Interesting links: Adding multiple solvers - https://cert-manager.io/docs/configuration/acme/#adding-multiple-solver-types Metrostar cert-manager-plugin - https://github.com/MetroStar/nebari-cert-manager/blob/main/src/nebari_plugin_cert_manager_chart/template/chart/templates/certificates.yaml

pt247 avatar May 29 '24 15:05 pt247

Just adding here that when we try to use Jupyter-scheduler on the NATO deployment (2024.6.1) it fails because of the self-signed cert:

HTTPSConnectionPool(host='nebari-static.cmreclimatechange.nsf', port=443): Max retries exceeded with url: /argo/api/v1/workflows/dev (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate (_ssl.c:1006)')))

Is there any hope for self-signed certs working in the future?

rsignell avatar Sep 01 '24 14:09 rsignell

Issue remains open #2499 is closed, but can be referenced for future work

dcmcand avatar Oct 23 '24 12:10 dcmcand