linkerd2 multicluster connectivity issue

What is the issue?

Here is the setup details for the 2 clusters - Master and Agent

Both clusters are on the default k3s setup i.e, it comes with default Flannel, Traefik, etc
Tried in both 2 RPI in 1 setup and another on 2 VMs in Google Cloud - both report exactly same issue
MariaDB database as Statefulset on the Agent-Cluster
Adminer UI on the Agent-Cluster
Linkerd with multicluster extension has been installed in both the clusters. The trust anchor is setup correctly as well
Mulicluster link with cluster name "agent" is created from Agent cluster and applied to Master. All the linkerd checks are passed with linkerd mc check correctly showing the
MariaDB database (on Agent cluster) has been annotated with the linkerd inject. Also a label for the mirror is added
MariaDB service is correctly started and mariadb-svc-agent is visible in Master cluster
The Adminer UI does not correct mariadb-svc-agent service. It reports unauthorized connection on server/linkerd-gateway. There should not be any unauthorized connected reported since I can see that both both apps Maraidb and adminer are meshed (in viz extension)
Alternatively, if I install the adminer in the agent cluster (the same where the mariadb is installed), the connection to direct Mariadb service mariadb-svc goes through fine. This is to prove that the connectivity between MariaDB and adminer works fine.

How can it be reproduced?

Please check the setup details in the "issue section" Images used:

mariadb
adminer

Logs, error output, etc

[ 7192.934712s] INFO ThreadId(01) inbound:server{port=4143}:gateway{dst=mariadb-svc.default.svc.cluster.local:3306}: linkerd_app_inbound::policy::tcp: Connection denied server.group=policy.linkerd.io server.kind=server server.name=linkerd-gateway tls=Some(Established { client_id: Some(ClientId(Name("default.default.serviceaccount.identity.linkerd.cluster.local"))), negotiated_protocol: Some("transport.l5d.io/v1") }) client=10.42.0.1:14266 [ 7192.935000s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=unauthorized connection on server/linkerd-gateway client.addr=10.42.0.1:14266 Logs fro

output of `linkerd check -o short`

Both for l --context=master -o short & l --context=agent -o short Status check results are √

Also the agent connectivity is fine. Output of l --context=master mc gateways CLUSTER ALIVE NUM_SVC LATENCY agent True 2 3ms

Environment

Kubernetes Client Version: v1.24.4+k3s1 Kustomize Version: v4.5.4 Server Version: v1.24.4+k3s1 Cluster Environment K3s running on 2 RPIs each with 1 node cluster. Also tested on K3s running on 2 VMs on Google Cloud as 1 node cluster Host OS: RPI - bullseye; Google Cloud: Ubuntu

Linkerd Version: 3.12

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

Sep 14 '22 22:09 manju-rn

Hi @manju-rn! It looks like your linkerd-gateway is rejecting connections from the remote cluster as unauthorized. To debug this, the linkerd authz tool is useful:

> linkerd authz -n linkerd-multicluster deploy/linkerd-gateway
ROUTE   SERVER               AUTHORIZATION_POLICY   SERVER_AUTHORIZATION   
*       linkerd-gateway                             linkerd-gateway        
*       gateway-proxy-admin                         linkerd-gateway-probe  
*       gateway-proxy-admin                         proxy-admin

This shows that the linkerd-gateway Server has a ServerAuthorization called linkerd-gateway. You should ensure that this resource exists and contains the right information:

> kubectl -n linkerd-multicluster get serverauthorization/linkerd-gateway -o yaml
[...]
spec:
  client:
    meshTLS:
      identities:
      - '*'
    networks:
    - cidr: 0.0.0.0/0
    - cidr: ::/0
  server:
    name: linkerd-gateway

This says that all meshed traffic from any source should be authorized.

Sep 15 '22 19:09 adleong

Thanks for the details. I shall check out the details later today. However, as a default install, i would have expected this to work out of box. Are there some changes to be made to Server Authorization objects during / after installation that i might have missed?

As i did multiple installation and confirmed that linkerd checks were successful, i was under the impression that all the crds and core resources would have been correct

Sep 15 '22 23:09 manju-rn

Yes, this should work out of the box and does in our testing. If you can provide a specific set of commands to reproduce the failure, we can investigate.

Sep 15 '22 23:09 adleong

okay. However, the out of box feature does not seem to be working in my setup. Here is the outcome of the server auth resources

manju@rpi400:~/mariadb/k3s $ k -n linkerd-multicluster get serverauthorization/linkerd-gateway -o yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"policy.linkerd.io/v1beta1","kind":"ServerAuthorization","metadata":{"annotations":{"linkerd.io/created-by":"linkerd/cli stable-2.12.0"},"labels":{"app":"linkerd-gateway","linkerd.io/extension":"multicluster"},"name":"linkerd-gateway","namespace":"linkerd-multicluster"},"spec":{"client":{"meshTLS":{"identities":["*"]},"networks":[{"cidr":"0.0.0.0/0"},{"cidr":"::/0"}]},"server":{"name":"linkerd-gateway"}}}
    linkerd.io/created-by: linkerd/cli stable-2.12.0
  creationTimestamp: "2022-09-16T01:44:37Z"
  generation: 1
  labels:
    app: linkerd-gateway
    linkerd.io/extension: multicluster
  name: linkerd-gateway
  namespace: linkerd-multicluster
  resourceVersion: "27024"
  uid: 48e2cd1f-8896-4407-a60f-ba97356f5123
spec:
  client:
    meshTLS:
      identities:
      - '*'
    networks:
    - cidr: 0.0.0.0/0
    - cidr: ::/0
  server:
    name: linkerd-gateway

I still the get same error

[  2490.197192s]  INFO ThreadId(01) inbound:server{port=4143}:gateway{dst=mariadb-svc.default.svc.cluster.local:3306}: linkerd_app_inbound::policy::tcp: Connection denied server.group=policy.linkerd.io server.kind=server server.name=linkerd-gateway tls=Some(Established { client_id: Some(ClientId(Name("default.default.serviceaccount.identity.linkerd.cluster.local"))), negotiated_protocol: Some("transport.l5d.io/v1") }) client=10.42.0.1:14516
[  2490.197465s]  INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=unauthorized connection on server/linkerd-gateway client.addr=10.42.0.1:14516

Sep 16 '22 03:09 manju-rn

Yes, this should work out of the box and does in our testing. If you can provide a specific set of commands to reproduce the failure, we can investigate.

I have attached the manifests file (attached as a log extension, please change it to yaml) for the mariadb and adminer (UI for DB). Here are the steps. The certificates for the rootCA and the intermediate CA are generated via openssl and the exact same are used while installing linkered in both the clusters

linkerd --context=master install --crds | kubectl --context=master apply -f -
linkerd --context=master install --identity-trust-anchors-file manjuca.crt --identity-issuer-certificate-file manjuissuer.crt --identity-issuer-key-file manjuissuer.key | kubectl --context=master apply -f -
linkerd --context=agent install --crds | kubectl --context=master apply -f -
# Use the common trust anchor certificates
linkerd --context=agent install --identity-trust-anchors-file manjuca.crt --identity-issuer-certificate-file manjuissuer.crt --identity-issuer-key-file manjuissuer.key | kubectl --context=agent apply -f -
linkerd --context=master mc install | kubectl --context=master apply -f -
linkerd --context=agent mc install | kubectl --context=agent apply -f -
linkerd --context=agent mc link --cluster-name agent | kubectl --context=master apply -f -

# Deployed the mariadb in agent
# Deployed the adminer in master
# mariadb-svc-agent was properly created in master
# For Adminer UI - goto http://<ip:add>:9090
# provide the mariadb-svc-agent.default.svc.cluster.local in adminer with testadmin/testadmin as credentials 
# Check logs of linkered-gateway pod (in agent)

mariadb-create-YAML.log adminer-create-YAML.log

Sep 16 '22 06:09 manju-rn

Just wondering if it is related to CNI implemenation. I will remove the defualt flannel of K3s and try with Calico

Sep 16 '22 21:09 manju-rn

@adleong Any results from your findings on the setup i have?

I tried setting up calico but looks like i have trouble starting the Loadbalancer and hence umable to test the linkered multicluster setup yet

Sep 19 '22 22:09 manju-rn

Thanks for the detailed reproduction instructions. We haven't had a chance to look into this yet.

Sep 19 '22 22:09 adleong

Thanks. Let me know any more details are required. I have also reproduce the same error with calico as the CNI on k3s clusters. This is just to remove suspicion if CNI (as default flannel in k3s) may be at fault

Sep 20 '22 23:09 manju-rn

So finally found the problem and fixed it. However, this is still a bug since I think the the default behaviour of Server is not honored

The problem is that the Server component has a default value of proxyProtocol set as HTTP/1. Hence, it was not allowing the MQTT traffic. So a change to proxyProtocol to unknown solved the issue for now. I will be finding out what other options work for MQTT to narrow it down

> k --context agent -n linkerd-multicluster get server/linkerd-gateway -o yaml

apiVersion: policy.linkerd.io/v1beta1
kind: Server
[...]
spec:
  podSelector:
    matchLabels:
      app: linkerd-gateway
  port: linkerd-proxy
  proxyProtocol: HTTP/1

However, as per documentation, snapshot below (from https://linkerd.io/2.12/reference/authorization-policy/), the proxyProtocol should have been defaulted to unknown. @adleong Please check and confirm whether this is the case. Also, is there a way to set this up during the mc link creation?

Sep 22 '22 06:09 manju-rn

Hi @manju-rn!

https://github.com/linkerd/linkerd2/pull/9575 has been merged. However, as we discussed, it's unlikely to be related to your issue. But it sounds like you're not experiencing the problem anymore? Is there any action to take here or should we close this issue?

Nov 23 '22 23:11 adleong

As of linkerd versiin 3.12, the issue was there. As i explained in earlier posts, the server component does not take the default value of Unknown for proxyProtocol as should happen as per docs. It was taking HTTP/1. It was resolved by me by manually editing the manifest. So if your saying that this default behaviour was corrected in new version, then yes, i will test it out.

Nov 23 '22 23:11 manju-rn

linkerd2 linkerd2 copied to clipboard

multicluster connectivity issue

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

linkerd2
linkerd2 copied to clipboard

output of `linkerd check -o short`