infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

DNS and Let's encrypt certificates for OCaml

Open hannesm opened this issue 2 years ago • 38 comments

Dear Madam or Sir,

with huge interest I read through some of the issues in this repository. Thanks for being open and transparent what you like to achieve.

Every other issue when it comes to migrations, I see that there are issues related to let's encrypt certificates and migration of services. The underlying reason, as far as I can tell, stems from the methodology of retrieving let's encrypt certificates: run a "certbot" locally, which requires (a) a web server on port 80, and (b) some ad-hoc configuration to serve static files, and (c) DNS changes being propagated for the desired hostname(s). This means, only once the actual service is deployed to live it can retrieve its certificate. This also makes moving services hard (without downtime).

Over the years, I worked on (fully open source, fully developed in OCaml as MirageOS unikernels) automation to push the whole let's encrypt interaction into DNS (a secondary server on steroids), and thus decoupling the actual service deployment from the certificate provisioning.

The idea is pretty simple: both certificates and signing requests are public data anyways (they're stored in the certificate transparency log, ...). DNS is a fault-tolerant key-value store. Each CSR and certificate is embedded as TLSA (https://www.rfc-editor.org/rfc/rfc6698.html) record (in DER encoding, i.e. no base64/pem, just the bare minimal stuff). Thanks to DNS TSIG we also have authentication (so not everyone may upload CSR) ;).

The mechanism is as follows: the primary DNS sends out DNS NOTIFY whenever the zone changes. The dns-letsencrypt-secondary observes zone(s), and whenever a fresh CSR is detected (or a soon expiring certificate, or a CSR without matching certificate (i.e. key rollover)), the let's encrypt DNS challenge is used to provision a new certificate.

The services behind just download (dig tlsa _letsencrypt._tcp.robur.coop) the certificate, and only need to have their private key distributed.

The operator can use nsupdate to upload a new certificate signing request.

If you're interested in using such a system (and run your own DNS servers - of course you can keep gandi's as advertised ones / public ones), don't hesitate to reach out. I'm happy to help figuring out how to work in that area. :)

hannesm avatar Jan 16 '23 14:01 hannesm

Thank you for suggesting this @hannesm, and of course of all the hard work you've done that has led to this being possible at all. I'm entirely supportive of the suggestion as the HTTP process is a real pain to manage, but would like to see the end-to-end process deployed somewhere other than ocaml.org first to ensure it's suitably mature.

@mtelvers, would you like to have a go at this on some other domain such as realworldocaml.org, and document it to your and @hannesm' satisfaction on infra.ocaml.org? Once it's documented and demonstrable elsewhere (particularly with respect to how to edit DNS zone files and so on, presumably via git), we will then need to present this to Xavier Leroy to get his permission to make the (big) change for the ocaml.org domain. I'm also tagging @ryangibb who is interested in matters of OCaml and DNS and may want to assist.

avsm avatar Jan 18 '23 14:01 avsm

What I failed to mention in the initial issue is that such a setup with OCaml-DNS has been used for various domains since more than 5 years; also mirage.io works this way.

Also, the let's encrypt challenge: so getting a signed certificate does not require the private key of the certificate signing request :D This is why the setup works pretty nicely.

I can for sure help how to setup a primary name server and using DNS zones in a git repository. I can also provide secondary name servers if desired.

hannesm avatar Jan 18 '23 14:01 hannesm

@avsm and thanks for your ~~support~~ interest.

hannesm avatar Jan 18 '23 14:01 hannesm

Hi @hannesm, I would be very interested in assisting with this if you could use my help.

I have a question if that's okay. I'm wondering if there's possible vulnerabilities with the spoofing of TLSA records. Reading RFC 6698:

This document defines a secure method to associate the certificate that is obtained from the TLS server with a domain name using DNS; the DNS information needs to be protected by DNSSEC.

As far as I understand, the OCaml-DNS resolver supports DNSSEC, but the authoritative server doesn't. Do you think this is an issue?

RyanGibb avatar Jan 25 '23 16:01 RyanGibb

@RyanGibb thanks for your offer.

From my observation, the current DNS deployment for ocaml.org / realworldocaml.org does not use DNSSec. Also, DNSSec integration into the authoritative servers (for OCaml-DNS) is on the agenda, and will be done this year.

As another point, so you can spoof TLSA records - but what is the attack vector? The service that uploads the CSR authenticates itself to the authoritative servers. The service that downloads the certificate checks that the public key in the certificate matches the private key it has. The certificate chain is checked to be valid (against the system trust anchors).

I don't quite understand where DNSSec would be necessary, but maybe I'm failing to see the attack vector (@RyanGibb would you mind to elaborate a bit more what you mean with "Do you thin this is an issue?"). The service can btw also ask the authoritative server directly for TLSA records (certificates).

To me, I wonder whether @mtelvers has an opinion and/or time for diving into such a thing ("running authoritative DNS services") or not. There's some IETF document that it is suggested to use anycast IP addresses for this, I'm myself not doing this since I don't have sufficiently many machines and BGP speakers to have such a setup -- but tbh it works fine with "just normal" IPv4 addresses ;)

hannesm avatar Jan 25 '23 21:01 hannesm

Hi @hannesm, thanks for your reply.

After reading https://hannes.nqsb.io/Posts/DnsServer I think I understand that the TLSA records are only used for distributing the CSRs and certificates between DNS servers and services in your solution, not for replacing a CA with the DNSSEC trust anchor as rfc6698 describes. Hence the TLSA RRs' location at _letsencrypt._tcp.robur.coop differs from https://www.rfc-editor.org/rfc/rfc6698#section-3. Is my understanding correct?

If so, I understand the reason why DNSSEC isn't required. The CA (letsencrypt) is the root of trust. The TSLA records are just a convent way of distributing the provided certificate. Apologies for my misunderstanding!

Also, DNSSec integration into the authoritative servers (for OCaml-DNS) is on the agenda, and will be done this year.

Aside from the letsencrypt DNS-01 challenge, that's great to hear :-)

RyanGibb avatar Jan 26 '23 17:01 RyanGibb

TLSA records are only used for distributing the CSRs and certificates between DNS servers and services in your solution, not for replacing a CA with the DNSSEC trust anchor as rfc6698 describes. Hence the TLSA RRs' location at _letsencrypt._tcp.robur.coop differs from https://www.rfc-editor.org/rfc/rfc6698#section-3. Is my understanding correct?

@RyanGibb yes, your understanding is correct.

Aside from the letsencrypt DNS-01 challenge, that's great to hear :-)

I'm not sure I understand what your comment means, would you mind to explain?

hannesm avatar Jan 26 '23 22:01 hannesm

@RyanGibb yes, your understanding is correct.

Great, thank you for confirming.

I'm not sure I understand what your comment means, would you mind to explain?

I just mean to say, despite DNSSEC not being required for the letsencrypt DNS challenge, it's good to know that it's on the agenda.

RyanGibb avatar Jan 27 '23 00:01 RyanGibb

Hi all, just to give a small update on this I've created a nameserver primarily targeting Unix using the new effects-based IO library and mirage OCaml-DNS library that is able to perform dynamic UPDATEs authenticated using TSIG: https://github.com/RyanGibb/aeon/. I'm hopping to add support for the letsencrypt challenge to this namserver directly (as opposed to running in a separate process), simplifying the communication required.

RyanGibb avatar Feb 27 '23 12:02 RyanGibb

Dear @RyanGibb, thanks for your effort. But I'd really like to hear from @mtelvers what would be worth for OCaml infrastructure. And I opened this issue to explicitly understand whether using MirageOS unikernels would be possible/interesting for that infrastructure.

I feel very torpedized and getting the issue stolen by your "hey, look, I developed something new that supports some parts (certainly no notify, hasn't been tested for years on real domains, etc.) in this shiny new IO framework" -- especially since I've been doing the underlying DNS development since 2017. Your "add support for the letsencrypt challenge to this namserver directly" is as well something that can be done in a MirageOS unikernel.

hannesm avatar Feb 27 '23 13:02 hannesm

My sincere apologies @hannesm. I in no way meant to torpedo this issue. You've been working on this for far longer than myself, and my contribution is a small layer using a different IO library. I just wanted to express my continued interest in this topic and share some work that I've been doing for a different project that relates to this issue. I should have been more clear on that.

RyanGibb avatar Feb 27 '23 13:02 RyanGibb

@hannesm @RyanGibb please do take a positive interpretation of each other's efforts. The world of managing OCaml infrastructure is small enough already without us driving each other away.

In my view, Ryan has been learning and reproducing Hannes' efforts, and that's appreciated. If you could perhaps split up your experiences with "reproducing the Mirage DNS stack" vs your own reimplementations on eio, that would be most useful for the knowledge sharing in this issue.

But let's wait for @mtelvers to comment on his plans first, and if he's not available, then make a wider call to the community for more assistance with reproducing the Mirage DNS stack on other domains .

avsm avatar Feb 27 '23 15:02 avsm

@hannesm I think your post may have been inspired by my convoluted implementation of Let's Encrypt certificates which I used for https://github.com/ocaml/infrastructure/issues/19. This not my preferred approach.

I prefer to use automatic provisioning, which is included in Caddy. In this case, this option was not available to me as the requirement was for round-robin DNS. With round-robin, I could not guarantee the response would arrive at the requesting server. The natural resolution is to use DNS challenge, but that was not available as DNS updates are administered manually by @avsm. Therefore, I switched to NGINX, this gave me the granular configuration required to redirect HTTP challenges to the originating server.

We would also need to agree on a hosting strategy for the unikernel to ensure a redundant deployment.

Reading through https://hannes.nqsb.io/Posts/DnsServer, under the Let's encrypt! section, how would we configure a reverse proxy such as (Caddy/NGINX) to request a certificate with an hmac-secret?

@avsm What are the success criteria? Or perhaps more importantly, what administrative controls need to be kept in place by the new solution?

If we can use DNS-01 Challenge rather than HTTP-01, then this would be worth implementing for the round-robin DNS for opam.ocaml.org. The alternative solution would be a Gandi API key which would delegate more access than just creating TXT records.

mtelvers avatar Feb 27 '23 23:02 mtelvers

how would we configure a reverse proxy such as (Caddy/NGINX) to request a certificate with an hmac-secret?

The initial CSR can be uploaded with nsupdate -y hmac-sha256:client._update:<b64-encoded-shared-secret> (where nsupdate is part of bind) -- my assumption is that "spawn new host names / services" requires human intervention anyways, and thus can be done by a human with the shared secret in their hands (also with the private/public key pair, which is used to produce the CSR).

The certificates can be downloaded by the service with the following shell script - e.g. via a cron job (since it is ensured that the certificate is updated 2 weeks before expiry):

#!/bin/sh

set -e

hostname=$1

dig_opts=" +noquestion +nocomments +noauthority +noadditional +nostats"
if [ $# = 2 ]; then
    dig_opts="$dig_opts @$2"
fi
data=$(dig tlsa _letsencrypt._tcp.$hostname $dig_opts | awk '{if (NR>3){print}}' | cut -f 5- -d ' ' | sort | grep '^[03] 0 0' | cut -d ' ' -f 4- | sed -e 's/ //g')
file=

hex_to_bin () {
    data=$(echo $@ | sed -e 's/\([0-9A-F][0-9A-F]\)/0x\1 /g')
    for hex in $data; do
        oct=$(printf "%o" $hex)
        if [ $oct = "0" ]; then
            printf "\0" >> $file
        else
            printf "%1b" $(echo '\0'$oct) >> $file
        fi
    done
}

cert_file=$(mktemp)
i=0
for cert in $data; do
    i=$(echo $i + 1 | bc)
    file=$(mktemp)
    hex_to_bin $cert
    openssl x509 -inform der -outform pem -in $file -out $cert_file.$i
    rm $file
done

# now mix and match, the final $i should be the leaf certificate
inter=$(mktemp)
last_inter=$(echo $i - 1 | bc)
for j in $(seq 1 $last_inter); do
    cat $cert_file.$j >> $inter
done

openssl verify -show_chain -verify_hostname $hostname -untrusted $inter $cert_file.$i

out=$hostname.pem

if [ -f $out ]; then
    out=$(mktemp)
fi

cat $cert_file.$i >> $out
cat $inter >> $out

rm -f $cert_file* $inter

echo "PEM bundle in $out"

hannesm avatar Feb 28 '23 00:02 hannesm

@mtelvers wrote:

@avsm What are the success criteria? Or perhaps more importantly, what administrative controls need to be kept in place by the new solution?

Good question. There's one important missing piece in our current infrastructure: secrets management. We already have a bunch of keys lying around, and with the DNS infrastructure will have even more with the various nsupdate pieces. So I think we need to come up with some way to store and securely share the various private material (and ensure that there's robust administrative controls there). Once we have that, I'm satisfied that we can manage the DNSKEYs and other pieces in this ticket well. Perhaps let's split this issue into a sub ticket on secrets management, if you think that's useful? (and @hannesm, do you have anything in your box of Mirage deployment tricks that might help with that?)

avsm avatar Feb 28 '23 09:02 avsm

secrets management

Would you mind to expand your requirements here?

As far as I can see, there is (when we consider self-hosted authoritative DNS)

  1. gandi.net passwords for access to domains (this is likely a private account, or some required by actual humans for changing things (authoritative NS) -- in a setup where DNS is self-hosted -- maybe use a password manager that allows sharing via an encrypted file?
  2. secrets for e.g. uploading a CSR (when a new service is deployed, i.e. a new DNS record needs to be enrolled as well)
  3. secrets between machines (i.e. secondary NS requires a shared secret for requesting a zone transfer from the primary)

Are there more types of secrets needed?

Certainly, the password manager (1) looks out of scope for this discussion. The secrets required by humans (2) to modify the zone file can be (a) access to a (private) git repository hosted on GitHub (as done by the mirage organization) (b) a shared secret in the password manager.

For the secrets between machines (3), the current setup (for e.g. mirage.io) is: the git repository with the zone files contains the shared secrets (to communicate between primary and secondary servers). The primary DNS server has access to it (via a ssh key that is provided as boot parameter (command line argument), the public part is registered with GitHub); the secondary DNS servers receive the shared secrets as boot parameter.

Now, lifting the boot parameters (none of the below is implemented yet)

  • To avoid passing shared secrets around, there could be a LDAP service (+clients), but then the primary server (and secondary servers) would need to authenticate themselves to the LDAP - which again means they'd need some command line parameters.
  • Another option would be a DHCP server that provides (from a configuration file) the matching secrets to the unikernel(s). Now, only the DHCP configuration file would contain the secrets, no more boot parameters (also, it allows to pass logging etc. via DHCP).
  • A third option is to have a web service that acts as configuration manager (i.e. has the secrets and communicates with albatross). This would as well allow (in contrast to DHCP) managing unikernels with secrets on different hosts (certainly this configuration manager now should be secured, but it could very well be a unikernel that requires two-factor authentication (webauthn)).

Let me know what you think, and/or let's have a discussion (maybe a video meeting?) about other approaches (and about the concrete goals).

hannesm avatar Feb 28 '23 11:02 hannesm

I'd only add to the secrets list for ocaml.org: 4) SSH keys for the hosts themselves 5) Capnproto capability files for various services (only necessary if we hook this into the RPC infrastructure) 6) service database passwords for (e.g.) watch.ocaml.org's postgres

Ahead of any discussion, it would be good to have the current status of the secrets in the ocaml.org cluster written down @mtelvers, and we can converge on what missing gaps there are in terms of rolling out any change in DNS infrastructure.

avsm avatar Feb 28 '23 12:02 avsm

A typical infrastructure deployment uses OCaml services running internally (usually under Docker) with HTTPS offloaded to a reverse proxy. Since we need a reverse proxy, the most straightforward approach is to use Caddy, which is a reverse proxy and manages the certificates automatically. Here is the entire Caddy configuration file needed for a typical service:

www.ocaml.org {
	reverse_proxy www:8080
}

In this setup, Caddy resolves the challenges automatically via HTTP challenge using the DNS entries that @avsm creates.

For a more complex setup, such as where say, if www.ocaml.org resolved to multiple addresses, the ideal setup would be to use the DNS-01 challenge. This is achieved like this (complete configuration file given):

{
	acme_dns gandi {env.GANDI_API_TOKEN}
}

www.ocaml.org {
	reverse_proxy www:8080
}

A typical invocation would be like this:

docker run -it --rm -e GANDI_API_TOKEN=__key__ -v ./Caddyfile:/etc/caddy/Caddyfile -v config:/config -v data:/data -p 80:80 -p 443:443 tuneitme/caddy

There is an outstanding issue to move deploy.ocamllabs.io to deploy.mirage.io. Currently this is deployed using Caddy exactly as described above. Perhaps we can use this as a test case to integrate your hmac script into Caddy? I also see that there is a Caddy module for hmac which may do what we need?

mtelvers avatar Feb 28 '23 14:02 mtelvers

I have not used caddy before, but I searched around a bit and I found this: https://github.com/caddy-dns/rfc2136

I'm not sure exactly how it works, and it may not be able to take advantage of the tricks in the letsencrypt secondary dns server, but I think it should work.

reynir avatar Feb 28 '23 15:02 reynir

@hannesm, I am working through your blog post, and some links may have been renamed/moved since it was reviewed in 2019. Can you help me locate these? Perhaps this is now released?

# git via ssh is not yet released, but this opam repository contains the branch information
$ opam repo add git-ssh git+https://github.com/roburio/git-ssh-dns-mirage3-repo.git

There is no branch future, but there is future-git? Can I substitute?

git clone -b future https://github.com/roburio/unikernels.git

mtelvers avatar Mar 01 '23 15:03 mtelvers

Dear @mtelvers, thanks a lot for your comment(s). Indeed, that changed a bit since the packages are now released. I'll work on revising that blog post.

The sources are now:

  • https://github.com/roburio/dns-primary-git -- the authoritative nameserver using a git remote
  • https://github.com/roburio/dns-secondary -- a secondary NS
  • https://github.com/roburio/dns-letsencrypt-secondary -- a secondary NS that cares about let's encrypt provisioning (TLSA records)

All these unikernels are as well available as reproducible binaries (hvt -- kvm) from our infrastructure:

  • https://builds.robur.coop/job/dns-primary-git
  • https://builds.robur.coop/job/dns-secondary
  • https://builds.robur.coop/job/dns-letsencrypt

hannesm avatar Mar 01 '23 16:03 hannesm

I have not used caddy before, but I searched around a bit and I found this: https://github.com/caddy-dns/rfc2136

I'm not sure exactly how it works, and it may not be able to take advantage of the tricks in the letsencrypt secondary dns server, but I think it should work.

Indeed, RFC2136 is the "dynamic updates for DNS" RFC, which is implemented by OCaml-DNS. And the configuration snippet from the link:

{
    "module": "acme",
    "challenges": {
        "dns": {
            "provider": {
                "name": "rfc2136",
                "key": "cWnu6Ju9zOki4f7Q+da2KKGo0KOXbCf6Pej6hW3geC4=",
                "key_name": "test",
                "key_alg": "hmac-sha256",
                "server": "1.2.3.4:53"
            }
        }
    }
}

Are supposed to directly work. This would mean: (a) no need for dns-letsencrypt-secondary (b) enroll your hmac secret with key and key_name available to caddy and dns-primary-git. I've not tested the interaction with caddy (but since it is RFC-specified, and works with bind, it should be fine).

hannesm avatar Mar 02 '23 10:03 hannesm

Relying on RFC2136 and using another interoperable bit of software like Caddy seems ideal here; well spotted @reynir. I'm hopeful that we'll eventually have a Caddy replacement in OCaml (I'm working on one on the side), but it'll obviously take some time to mature before being suitable for OCaml.org deployment.

avsm avatar Mar 02 '23 15:03 avsm

@hannesm I have made some progress, but it doesn't seem to work, and I am unsure where I am going wrong. I have a DNS server up and running with this command:

sudo ./_build/default/primary-git --remote=https://github.com/mtelvers/tunbury-uk-dns --ipv4=a.b.c.d/24 -l debug

However, it doesn't work when I try to test it with this (per your example).

$ host ns1.tunbury a.b.c.d
Using domain server:
Name: a.b.c.d
Address: a.b.c.d#53
Aliases: 

Host ns1.tunbury not found: 9(NOTAUTH)

The console output is

2023-03-02 15:30:01 +00:00: DBG [dns_server] from w.x.y.z received:header 8712 (query) operation Query rcode 
                                            no error flags: recursion desired
                                            question ns1.tunbury A?
                                            data query additional 
                                            EDNS no TSIG no
2023-03-02 15:30:01 +00:00: DBG [dns_mirage] udp: sending 29 bytes from 53 to w.x.y.z:16906

I get rate limited by Git pretty quickly; therefore, I tried to use a local git server. I generated a key with awa_gen_key and put the public key in the git user's ~/.ssh/authorized_keys. I scanned the host with ssh-keygen, etc. On the command line, I added --authenticator=SHA256:xxx and --remote=ssh://[email protected]/tunbury-uk-dns.git. The --seed parameter now seems to be --ssh-key=rsa:seed. However, I get the error below. I tried various options, such as using the public IP rather than 127.0.0.1

2023-03-02 15:41:12 +00:00: ERR [git-fetch] The Git peer is not reachable.
2023-03-02 15:41:12 +00:00: ERR [application] couldn't initialize git repository ssh://[email protected]/tunbury-uk-dns.git: error fetching: No connection found

My apologies; I am probably making some basic error.

mtelvers avatar Mar 02 '23 15:03 mtelvers

Thanks for your report, @mtelvers. I just pushed an update to the blog post.

To answer your trouble:

--remote=https://github.com/mtelvers/tunbury-uk-dns

should be --remote=https://github.com/mtelvers/tunbury-uk-dns.git I think

--authenticator=SHA256:xxx and --remote=ssh://[email protected]/tunbury-uk-dns.git. The --seed parameter now seems to be --ssh-key=rsa:seed. However, I get the error below. I tried various options, such as using the public IP rather than 127.0.0.1

Indeed that argument changed: awa_gen_key has a --keytype argument (with ed25519 or rsa) now; ssh-key= takes rsa:seed or ed25519:key; and --remote should be git@IPorHOST:path.git (no more ssh://, and a : instead of / after the IPorHOST)

Hope that helps.

hannesm avatar Mar 02 '23 17:03 hannesm

@hannesm Thank you. The local git repository is now working. However, it is still not resolving names. I'll have another look in the morning. Thanks again.

mtelvers avatar Mar 02 '23 18:03 mtelvers

@mtelvers if you pass -l \*:debug, you'll see quite some log messages. But testing the zone file may be worth it first (opam install dns-cli and ozone /path/to/zonefile).

hannesm avatar Mar 02 '23 18:03 hannesm

Success! The remote must include the branch even when there is only one branch called master. Thus this works:

sudo ./_build/default/primary-git [email protected]:tunbury-uk-dns.git'#master'

mtelvers avatar Mar 03 '23 09:03 mtelvers

Happy to hear it works! :partying_face:

It defaults to trying branch main if not specified :-)

reynir avatar Mar 03 '23 09:03 reynir

Domain tunbury.uk is now using dns-primary-git as the name server, and the certificate for https://www.tunbury.uk was deployed directly from Caddy using RFC2136 and a DNS-01 challenge. I'll document the steps.

mtelvers avatar Mar 03 '23 16:03 mtelvers