etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Supporting setting `LocalAddr` in peer communication

Open ahrtr opened this issue 1 year ago • 13 comments

What would you like to be added?

Currently in peer communication, the local address is automatically chosen by the golang standard lib,

https://github.com/etcd-io/etcd/blob/d8f077151da01d34162c5e6bdccac315777c4369/client/pkg/transport/transport.go#L33-L43

We should support setting a LocalAddr, which defaults to the advertise Peer address if enabled. It should be disabled by default to keep the existing behavior unless users intentionally enable it.

Why is this needed?

When accepting a connection from a peer, each etcd member checks the peer's remote address against the Subject Alternative Name (SAN) field in the peer's certificate, and reject the connection if they don't match. Please read https://etcd.io/docs/v3.5/op-guide/security/#notes-for-tls-authentication.

See also https://github.com/etcd-io/etcd/blob/d8f077151da01d34162c5e6bdccac315777c4369/client/pkg/transport/listener_tls.go#L192-L221

Sometimes the peer may has multiple IP addresses, even including dynamic (i.e. floating) IP addresses. Probably only part of the IP addresses are valid against the SAN field in the certificate. But if an IP address which doesn't match the SAN in certificate is chosen by the golang standard lib, then it will fail to communicate with other etcd members.

ahrtr avatar Dec 05 '23 11:12 ahrtr

@ahrtr Would like to work on this!

highpon avatar Dec 05 '23 23:12 highpon

Thanks @highpon .

ping @serathius @jmhbnz @wenjiaswe @fuweid for feedback.

ahrtr avatar Dec 06 '23 07:12 ahrtr

cc @aojea for opinion as well

ahrtr avatar Dec 18 '23 11:12 ahrtr

cc @aojea for opinion as well

I can see that there are multiple flags that declare addresses identifying the client, should you not use one of those since those are known to be reachable?

aojea avatar Dec 18 '23 21:12 aojea

cc @aojea for opinion as well

I can see that there are multiple flags that declare addresses identifying the client, should you not use one of those since those are known to be reachable?

Thanks for the feedback. Yes, I was planning to reuse the --initial-advertise-peer-urls if users enable this feature (disabled by default to keep the existing behaviour). I am not sure whether it's a best practice to set LocalAddr, and whether there are any potential issue. Any comment?

$ etcd \
  --name m1 \
  --listen-client-urls http://host1:2379 \
  --advertise-client-urls http://host1:2379 \
  --listen-peer-urls http://host1:2380 \
  --initial-advertise-peer-urls http://host1:2380

Another problem is that it's hard to verify it in workflow CI. For example, how to create multiple IP addresses on the environment.

ahrtr avatar Dec 18 '23 21:12 ahrtr

I am not sure whether it's a best practice to set LocalAddr, and whether there are any potential issue. Any comment?

In an etcd cluster there are some strong requirements or network connectivity between the peers, my experience is that every advanced networking knob like this can have also an undesired effect of providing a footgun for users that starts to implement complex scenarios just because they can or because they came with that idea and then they hit a problem and they start to ask for new knobs. I always prefer to have the use cases and document the scenarios with how the new knobs are supposed to be used before adding them.

aojea avatar Dec 18 '23 22:12 aojea

I always prefer to have the use cases and document the scenarios with how the new knobs are supposed to be used before adding them.

Yes, agreed on this. It's a real problem discovered in our testing when etcd and kube-vip coexist. Just as I mentioned in https://github.com/etcd-io/etcd/issues/17068#issue-2025936995, When an etcd member accepts a connection from a peer, it checks the peer's remote address against the Subject Alternative Name (SAN) field in the peer's certificate, and reject the connection if they don't match. It's exactly the reason why the LocalAddr matters here.

But when kube-vip coexist with etcd on the same host, kube-vip may dynamically add a floating IP to the host. Especially for the case of IPv6, it adds the IP to the beginning of the interface's IP list. Since etcd doesn't specify the LocalAddr when communicating with a peer, the floating IP is automatically chosen as the LocalAddr by the golang standard lib, eventually the connection is rejected by the peer.

I believe this is also a common issue. It applies the cases where the hosts on which the etcd runs have multiple IP addresses, especially dynamical floating IP.

Explicitly, it would be great if we can get feedback from network experts on:

  • In the beginning, we will disable the new feature by default, and etcd will not set a LocalAddr when communicating with a peer unless users intentionally enable it. But If there is no any concern or side effect, eventually we may want to enable it by default.

  • Another problem is how to automatically verify it in workflow CI using an e2e test case to prevent any regression. I don't see an easy way to mimic the case of the host having multiple IP addresses in workflow CI environment. Probably we can try unit test?

ahrtr avatar Dec 19 '23 10:12 ahrtr

that makes sense, then add just an option to define the localAddress for connecting to the apiserver, if is empty just don't set the localAddr field, this will make the change completely backwards compatible and you don't have to worry about the future.

Another problem is how to automatically verify it in workflow CI using an e2e test case to prevent any regressio

some ideas, if you are in the same node you can play with the localhost address 127.0.0.1 and the one in the node

  1. create a TCP listener that receives the connection and assert on the address you want
  2. test the negative case, verify that fails to connect if you set a wrong address

aojea avatar Dec 19 '23 14:12 aojea

you can play with the localhost address 127.0.0.1 and the one in the node

Thanks for the suggestion.

ahrtr avatar Dec 19 '23 15:12 ahrtr

@highpon Please follow comments below to update https://github.com/etcd-io/etcd/pull/17085. Thanks

  • Let's add an option --set-member-localaddr (bool), which defaults to false. If users enable it, then let's use the host in --initial-advertise-peer-urls as the LocalAddr when etcd communicates with a peer.

  • Follow comment above https://github.com/etcd-io/etcd/issues/17068#issuecomment-1862885759 to add an unit test. **play with the localhost address 127.0.0.1 and the one in the node**. We need to test both positive and negative cases.

    • Please also add an e2e test. We just need to verify positive case in e2e case to ensure it doesn't break anything. Specifically, we just need to set --set-member-localaddr to true in both 1-member and 3-members clusters, and ensure nothing is broken.

ahrtr avatar Dec 19 '23 16:12 ahrtr

@ahrtr Sure! Thanks for your help! I would like to update #17085 for your comment.

highpon avatar Dec 19 '23 20:12 highpon

I've tried to pick up the ball to implement these changes here: https://github.com/flawedmatrix/etcd/tree/fix/17068

One question I had was whether we only wanted to set LocalAddr to an IP address, or should we try to resolve the DNS name to an address from --initial-advertise-peer-urls?

~~The major problem I ran into was that I've been unable to get the e2e test to fail Mutual TLS when playing around with using the host IP address or 127.0.0.1. It seems the etcd nodes would only start up properly if all the IP addresses provided in --listen-peer-urls, --initial-advertise-peer-urls, and --initial-cluster matched. I've been unable to get the Golang library to infer LocalAddr to be 127.0.0.1 if I used the host IP for the three flags I mentioned.~~

~~@ahrtr Any thoughts as to how one could properly test this feature in an e2e test?~~

After revisiting this issue, I've resolved the issue of how to e2e test this feature.

flawedmatrix avatar Mar 09 '24 02:03 flawedmatrix

One question I had was whether we only wanted to set LocalAddr to an IP address, or should we try to resolve the DNS name to an address from --initial-advertise-peer-urls?

The answer should be the latter based on https://github.com/etcd-io/etcd/issues/17068#issuecomment-1863121505

ahrtr avatar Mar 29 '24 10:03 ahrtr