etcd icon indicating copy to clipboard operation
etcd copied to clipboard

better TLS related documentation & client auth

Open sfuerte opened this issue 3 years ago • 7 comments

What happened?

Though not advised, but Transport security model and Clustering Guide suggest that a shared TLS certificate can be used among all server nodes. E.g., the following certificate supposed to be accepted:

resource "acme_certificate" "etcd" {
  common_name               = aws_route53_zone.vdc_zone.name # or "etcd.${aws_route53_zone.vdc_zone.name}"
  subject_alternative_names = concat(
    [aws_route53_zone.vdc_zone.name],
    [for _idx in range(var.etcd_count) : "etcd${var.etcd_count > 1 ? tostring(_idx + 1) : ""}.${aws_route53_zone.vdc_zone.name}"]
  )
...

In fact, it does NOT work. Potentially a bit more description in older #8603 Switching to individual certificate where CN equals to etcd_advertise_fqdn and SAN contains it along with etcd_discovery_srv_fqdn solves the problem. Config template file is below.

Certificate generation:

resource "acme_certificate" "etcd" {
  count = var.etcd_count

  common_name        = "etcd${var.etcd_count > 1 ? tostring(count.index + 1) : ""}.${aws_route53_zone.vdc_zone.name}"
  subject_alternative_names = [
    aws_route53_zone.vdc_zone.name,
    "etcd${var.etcd_count > 1 ? tostring(count.index + 1) : ""}.${aws_route53_zone.vdc_zone.name}"
  ]
...

Secondly, creating both SSL & plain-text DNS SRV records, i.e. _etcd-server-ssl._tcp.example.com AND _etcd-server._tcp.example.com, causing multiple tls: first record does not look like a TLS handshake errors (see #9917). Creating just the SSL one solves the issue.

Lastly, starting HTTPS client with the default disabled client-cert-auth (as per https://github.com/etcd-io/etcd/blob/main/etcd.conf.yml.sample) still requires a client to be authenticated otherwise it fails with following (insecure-skip-tls-verify doesn't help either):

$ sudo /opt/etcd/bin/etcdctl --write-out=table endpoint status --cacert=/opt/etcd/etc/ssl/tls_issuer.pem --endpoints=etcd1.local:2379  --cluster --insecure-skip-tls-verify --debug
INFO: 2022/06/05 23:32:43 [core] Subchannel picks a new address "etcd1.local:2379" to connect
INFO: 2022/06/05 23:32:43 [balancer] base.baseBalancer: handle SubConn state change: 0xc0002a11a0, CONNECTING
WARNING: 2022/06/05 23:32:43 [core] grpc: addrConn.createTransport failed to connect to {etcd1.local:2379 etcd1.local <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
WARNING: 2022/06/05 23:32:43 [core] grpc: addrConn.createTransport failed to connect to {etcd1.local:2379 etcd1.local <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
INFO: 2022/06/05 23:32:43 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2022/06/05 23:32:43 [balancer] base.baseBalancer: handle SubConn state change: 0xc0002a11a0, TRANSIENT_FAILURE
{"level":"warn","ts":"2022-06-05T23:32:43.837Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002d76c0/etcd1.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: remote error: tls: bad certificate\""}
Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded

When supplying client certs, then works as expected:

$ sudo /opt/etcd/bin/etcdctl --write-out=table endpoint status --cacert=/opt/etcd/etc/ssl/tls_issuer.pem --endpoints=etcd1.local:2379    --cert=/opt/etcd/etc/ssl/tls_certificate.pem --key=/opt/etcd/etc/ssl/tls_private_key.pem --cluster
+--------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|                  ENDPOINT                  |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://etcd1.local:2379                   | xxx              |   3.5.4 |   20 kB |     false |      false |         2 |          9 |                  9 |        |
| https://etcd3.local:2379                   | xxx              |   3.5.4 |   20 kB |     false |      false |         2 |          9 |                  9 |        |
| https://etcd2.local:2379                   | xxx              |   3.5.4 |   20 kB |      true |      false |         2 |          9 |                  9 |        |
+--------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

What did you expect to happen?

working as per current documentation

How can we reproduce it (as minimally and precisely as possible)?

see above

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5

Etcd configuration (command line flags or environment variables)

discovery-srv: "{{ etcd_discovery_srv_fqdn }}"

listen-peer-urls: "https://{{ mgmt_ip }}:2380"
listen-client-urls: "https://{{ mgmt_ip }}:2379,https://127.0.0.1:2379"

advertise-client-urls: "https://{{ etcd_advertise_fqdn }}:2379"

initial-cluster-state: "new"
initial-cluster-token: "etcd-{{ datacenter }}-cluster"
initial-advertise-peer-urls: "https://{{ etcd_advertise_fqdn }}:2380"

# Client-to-server communication
client-transport-security:
  cert-file: {{ ETCD_SSL }}/tls_certificate.pem
  key-file: {{ ETCD_SSL }}/tls_private_key.pem
  trusted-ca-file: {{ ETCD_SSL }}/tls_issuer.pem
  # client-cert-auth: false
...

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

No response

sfuerte avatar Jun 05 '22 23:06 sfuerte

@sfuerte for last question, could you please provide detail setup to help me understand your question. client-cert-auth option only help testing case. don't need care in your product setting in my options.

xiaods avatar Jun 19 '22 06:06 xiaods

@sfuerte for last question, could you please provide detail setup to help me understand your question. client-cert-auth option only help testing case. don't need care in your product setting in my options.

@xiaods , sorry for the delay. The problem we encountered with client-cert-auth that it is disabled by default but with enabled TLS, it forcefully gets enabled also.

sfuerte avatar Jun 28 '22 20:06 sfuerte

@sfuerte Where are you setting below ? within the config ? not seeing as flag here https://etcd.io/docs/v3.1/op-guide/configuration/

"Switching to individual certificate where SAN equals to etcd_advertise_fqdn solves the problem"

deeco avatar Jun 28 '22 21:06 deeco

@sfuerte Where are you setting below ? within the config ? not seeing as flag here https://etcd.io/docs/v3.1/op-guide/configuration/

"Switching to individual certificate where SAN equals to etcd_advertise_fqdn solves the problem"

@deeco , scratch that one - it was my derailed thought train after a long day of troubleshooting 🤦‍♂️ Just updated it above with the correct one:

Switching to individual certificate where CN equals to etcd_advertise_fqdn and SAN contains it along with etcd_discovery_srv_fqdn solves the problem.

As for the config, you need to click at "Details" under "ETCD configuration" to expand it. Note, we don't use the cli flags directly but rather a config file. Also, the link you posted is for outdated 3.1 vs current 3.5 (https://etcd.io/docs/v3.5/op-guide/configuration/#configuration-file) image

sfuerte avatar Jun 28 '22 21:06 sfuerte

@sfuerte for last question, I notice you missing a param: --insecure-transport=false, have a try

xiaods avatar Jul 04 '22 03:07 xiaods

@sfuerte for last question, I notice you missing a param: --insecure-transport=false, have a try

@xiaods, for the server or etcdctl ? If for the former, believe it's not documented in the config flags at all. If for the latter, etcdctl, don't have it handy (and can't find its possible CLI args as it's not posted in the docs) but AFAIR tried that one but didn't help.

sfuerte avatar Jul 04 '22 06:07 sfuerte

@sfuerte i search backlog issue and found this:

etcdctl --insecure-transport=false --insecure-skip-tls-verify <your command> will work.

refs: https://github.com/etcd-io/etcd/issues/11693

xiaods avatar Jul 04 '22 23:07 xiaods

Personally, I am a little bit puzzled by the etcd TLS documentation. I think it lacks specific requirements for serving, client and peer certificates. I think the concept of a "peer" certifcate is a little bit uncommon, at least in the areas where I come from, which is mostly just HTTP(S).

I think that I would personally find it very helpful if the client/server/peer certificates were documented such as:

## Certificates

### Peer Certificate (--peer-cert-file) 
A basic peer certificate is defined by:
- Subject - it contains this and that
- Extensions -
    - EKU - Client Authentication, Server Authentication
    - SAN DNS - for resolvable hostnames of the machine where this certificate can be used
    - SAN IPs - for IPs of the machine where this certificate can be used

I created a repository where I am trying to guess what the requirements for each of the certificates should be. It allows to easily set up a small 3-node etcd cluster with docker. By observing the logs, I tried to converge to what the expectations on certs are. It took me 4 tries to get the certificates right: https://github.com/stlaz/etcd-certs/

I'd like to think that the above repository might be useful when trying to figure out what the documentation should be.

stlaz avatar Sep 29 '22 09:09 stlaz

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 18 '23 09:03 stale[bot]

@serathius I feel like the documentation still isn't clear on the specific extensions needed for the different certificates. Should this be reopened or tracked elsewhere?

figaw avatar Dec 15 '23 13:12 figaw