linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Envoy Gateway cannot route to a Service mirrored by Linkerd

Open ferdinandosimonetti opened this issue 1 year ago • 5 comments

issue.zip

What is the issue?

I have two clusters on AKS: both have Envoy Gateway (with two Gateways each, one public and one private) and Linkerd (with multicluster) installed.

I've meshed in the Envoy Gateway installation and the Gateways (by adding the appropriate annotation inside the EG values file and Gateway definition YAML).

On cluster1 I've defined an HTTPRoute with a backendRef; it works perfectly when it points to a local (running on cluster1) Service, and fails when the destination is an exported Service, running on cluster2 and reachable via multicluster Linkerd's feature.

The same exported Service is reachable without any problem from a test Deployment (a basic Debian container with curl installed) running on cluster1 and meshed with Linkerd via the same annotation.

How can it be reproduced?

Every referenced YAML file can be found inside attached ZIP.

Create two clusters on AKS (version 1.27.7).

Install Reflector on both

helm repo add emberstack https://emberstack.github.io/helm-charts
helm repo update
helm upgrade --install reflector emberstack/reflector --namespace kube-system

Install Cert-Manager on both

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --version v1.13.3 --set installCRDs=true --create-namespace

Create a wildcard certificate via Cert-Manager + LetsEncrypt (my own is backed by Cloudflare) on both clusters

kubectl apply -f cm-secret.yml
kubectl apply -f cm-clusterissuer.yml
kubectl apply -f cm-wildcard.yml

Install Envoy Gateway on both (via Helm Chart, latest version)

helm upgrade --install eg oci://docker.io/envoyproxy/gateway-helm --version v0.0.0-latest -n envoy-gateway-system --create-namespace --values envoy-values.yml

Define GatewayClass and Gateway(s).

[aks1-admin|default] ferdi@u820:~ kubectl apply -f 0-gatewayclass.yml
[aks1-admin|default] ferdi@u820:~ kubectl apply -f aks1-1-gateway.yml
[issue.zip](https://github.com/linkerd/linkerd2/files/13821092/issue.zip)

[aks1-admin|default] ferdi@u820:~ kubectl apply -f aks1-1-gateway.yml

[aks2-admin|default] ferdi@u820:~ kubectl apply -f 0-gatewayclass.yml
[aks2-admin|default] ferdi@u820:~ kubectl apply -f aks2-1-gateway.yml
[aks2-admin|default] ferdi@u820:~ kubectl apply -f aks2-1-gateway.yml

Install Linkerd CRDs via Helm Chart (except HTTPRoute) on both

helm repo add linkerd https://helm.linkerd.io/stable
helm repo update

helm install linkerd-crds linkerd/linkerd-crds \
 -n linkerd --create-namespace \
 --set enableHttpRoutes=false

Install Linkerd on both clusters, using the same SSL certificates / keys previously created as described here

helm install linkerd-control-plane \
 -n linkerd \
 --set-file identityTrustAnchorsPEM=linkerd/ca.crt \
 --set-file identity.issuer.tls.crtPEM=linkerd/issuer.crt \
 --set-file identity.issuer.tls.keyPEM=linkerd/issuer.key \
 linkerd/linkerd-control-plane

Install Linkerd multicluster on both clusters

helm install linkerd-multicluster -n linkerd-multicluster --create-namespace linkerd/linkerd-multicluster --values values-multicluster.yml

Link together the clusters

[aks1-admin|default] ferdi@u820:~$ linkerd multicluster link --cluster-name=aks1 > link-aks1.yml

[aks2-admin|default] ferdi@u820:~/falck/activities/documento-envoy-linkerd$ linkerd multicluster link --cluster-name=aks2 > link-aks2.yml

[aks1-admin|default] ferdi@u820:~/falck/activities/documento-envoy-linkerd$ kubectl apply -f link-aks2.yml

[aks2-admin|default] ferdi@u820:~/falck/activities/documento-envoy-linkerd$ kubectl apply -f link-aks1.yml  

Create a meshed microservice on cluster2 and export it

[aks2-admin|default] ferdi@u820:~$ kubectl apply -f aks2-test2-deploy-meshed.yml
[aks2-admin|default] ferdi@u820:~$ kubectl label svc/test2 mirror.linkerd.io/exported=true 

Reachability test from cluster1 from an ad-hoc (meshed) Deployment

[aks1-admin|default] ferdi@u820:~$ kubectl apply -f prova-deploy.yml

[aks1-admin|default] ferdi@u820:~$ kubectl exec -ti deploy/prova -c prova -- curl -v http://test2-aks2:3000/
*   Trying 10.0.94.88:3000...
* Connected to test2-aks2 (10.0.94.88) port 3000 (#0)
...
{
 "path": "/",
 "host": "test2.default.svc.cluster.local:3000",
...
  "L5d-Dst-Canonical": [
   "test2.default.svc.cluster.local:3000"
  ],
  "User-Agent": [
   "curl/7.88.1"
  ]
 },
 "namespace": "default",
 "ingress": "",
 "service": "",
 "pod": "test2-7f54df8775-gnh8t"
...

It works.

When I try with an HTTPRoute

---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: test1
  namespace: default
spec:
  hostnames:
  - test1.fsimonetti.info
  parentRefs:
  - name: webext1
    sectionName: test1
    namespace: default
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: test2-aks2
      port: 3000
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /

There's the problem

[aks1-admin|linkerd-multicluster] ferdi@u820:~/falck/activities/documento-envoy-linkerd$ curl -v https://test1.fsimonetti.info/
*   Trying 4.232.12.166:443...
* Connected to test1.fsimonetti.info (4.232.12.166) port 443 (#0)
...
< HTTP/2 500 
< l5d-proxy-error: unexpected error
< l5d-proxy-connection: close
...

Thanks in advance

Logs, error output, etc

Logs of linkerd-proxy container for the ext1 Envoy Gateway deployment

[ 15541.427775s]  INFO ThreadId(01) outbound:proxy{addr=10.1.2.8:4143}:forward{addr=10.1.2.8:4143}:rescue{client.addr=10.244.1.18:35686}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=endpoint 10.1.2.8:4143: connection closed before message completed error.sources=[connection closed before message completed]
[ 15541.427810s]  INFO ThreadId(01) outbound:proxy{addr=10.1.2.8:4143}:rescue{client.addr=10.244.1.18:35686}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.1.2.8:4143: route default.endpoint: backend default.unknown: endpoint 10.1.2.8:4143: connection closed before message completed error.sources=[route default.endpoint: backend default.unknown: endpoint 10.1.2.8:4143: connection closed before message completed, backend default.unknown: endpoint 10.1.2.8:4143: connection closed before message completed, endpoint 10.1.2.8:4143: connection closed before message completed, connection closed before message completed]
[ 15541.427819s]  WARN ThreadId(01) outbound:proxy{addr=10.1.2.8:4143}:rescue{client.addr=10.244.1.18:35686}: linkerd_app_outbound::http::server: Unexpected error error=logical service 10.1.2.8:4143: route default.endpoint: backend default.unknown: endpoint 10.1.2.8:4143: connection closed before message completed error.sources=[route default.endpoint: backend default.unknown: endpoint 10.1.2.8:4143: connection closed before message completed, backend default.unknown: endpoint 10.1.2.8:4143: connection closed before message completed, endpoint 10.1.2.8:4143: connection closed before message completed, connection closed before message completed]

Logs of envoy container for the ext1 Envoy Gateway deployment

{"start_time":"2024-01-03T16:02:11.868Z","method":"GET","x-envoy-origin-path":"/","protocol":"HTTP/2","response_code":"500","response_flags":"-","response_code_details":"via_upstream","connection_termination_details":"-","upstream_transport_failure_reason":"-","bytes_received":"0","bytes_sent":"0","duration":"6","x-envoy-upstream-service-time":"5","x-forwarded-for":"10.244.1.18","user-agent":"curl/7.88.1","x-request-id":"c93cad25-1cb7-49bd-bfa3-c96c8b3acd7d",":authority":"test1.fsimonetti.info","upstream_host":"10.1.2.8:4143","upstream_cluster":"httproute/default/test1/rule/0","upstream_local_address":"10.244.1.18:35686","downstream_local_address":"10.244.1.18:10443","downstream_remote_address":"10.244.1.18:46194","requested_server_name":"test1.fsimonetti.info","route_name":"httproute/default/test1/rule/0/match/0/test1_fsimonetti_info"}
{"start_time":"2024-01-03T16:03:07.779Z","method":"GET","x-envoy-origin-path":"/","protocol":"HTTP/1.1","response_code":"404","response_flags":"NR","response_code_details":"route_not_found","connection_termination_details":"-","upstream_transport_failure_reason":"-","bytes_received":"0","bytes_sent":"0","duration":"0","x-envoy-upstream-service-time":"-","x-forwarded-for":"10.244.1.18","user-agent":"-","x-request-id":"f9b2b015-b7c7-4eb5-a063-36893625f643",":authority":"4.232.12.166:80","upstream_host":"-","upstream_cluster":"-","upstream_local_address":"-","downstream_local_address":"10.244.1.18:10080","downstream_remote_address":"10.244.1.18:60978","requested_server_name":"-","route_name":"-"}

output of linkerd check -o short

[aks1-admin|default] ferdi@u820:~/falck/activities/documento-envoy-linkerd$ linkerd check -o short
Status check results are √
[aks2-admin|default] ferdi@u820:~/falck/activities/documento-envoy-linkerd$ linkerd check -o short
Status check results are √

Environment

  • Kubernetes version: AKS 1.27.7 with kubenet
  • Envoy Gateway version: v0.0.0-latest (as of yesterday)
  • Linkerd version: 2.14.7

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

yes

ferdinandosimonetti avatar Jan 03 '24 16:01 ferdinandosimonetti

I've tried reinstalling Envoy Gateway and its Gateways using linkerd.io/inject: ingress instead of enabled as suggested here to mesh various Ingress Controllers, but the problem remains.

ferdinandosimonetti avatar Jan 03 '24 16:01 ferdinandosimonetti

Hey @ferdinandosimonetti! Can we see the full linkerd check output in this case? (or, if you don't want to do that, linkerd mc check will be enough to get started with...)

kflynn avatar Jan 04 '24 15:01 kflynn

Hi Flynn! Can we suspend the issue until Jan, 7th? Actually I'm on vacation with very limited access to the affected clusters... Thanks in advance

⁣Get BlueMail for Android ​

On Jan 4, 2024, 16:53, at 16:53, Flynn @.***> wrote:

Hey @ferdinandosimonetti! Can we see the full linkerd check output in this case? (or, if you don't want to do that, linkerd mc check will be enough to get started with...)

-- Reply to this email directly or view it on GitHub: https://github.com/linkerd/linkerd2/issues/11871#issuecomment-1877330767 You are receiving this because you were mentioned.

Message ID: @.***>

ferdinandosimonetti avatar Jan 04 '24 21:01 ferdinandosimonetti

@ferdinandosimonetti Not a problem! We'll come back to it next week. 🙂

kflynn avatar Jan 05 '24 01:01 kflynn

Here we are, Flynn!!

Il giorno gio, 04/01/2024 alle 07.53 -0800, Flynn ha scritto:

Hey @ferdinandosimonetti! Can we see the full linkerd check output in this case? (or, if you don't want to do that, linkerd mc check will be enough to get started with...) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

kubernetes-api

√ can initialize the client √ can query the Kubernetes API

kubernetes-version

√ is running the minimum Kubernetes API version

linkerd-existence

√ 'linkerd-config' config map exists √ heartbeat ServiceAccount exist √ control plane replica sets are ready √ no unschedulable pods √ control plane pods are ready √ cluster networks contains all node podCIDRs √ cluster networks contains all pods √ cluster networks contains all services

linkerd-config

√ control plane Namespace exists √ control plane ClusterRoles exist √ control plane ClusterRoleBindings exist √ control plane ServiceAccounts exist √ control plane CustomResourceDefinitions exist √ control plane MutatingWebhookConfigurations exist √ control plane ValidatingWebhookConfigurations exist √ proxy-init container runs as root user if docker container runtime is used

linkerd-identity

√ certificate config is valid √ trust anchors are using supported crypto algorithm √ trust anchors are within their validity period √ trust anchors are valid for at least 60 days √ issuer cert is using supported crypto algorithm √ issuer cert is within its validity period √ issuer cert is valid for at least 60 days √ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls

√ proxy-injector webhook has valid cert √ proxy-injector cert is valid for at least 60 days √ sp-validator webhook has valid cert √ sp-validator cert is valid for at least 60 days √ policy-validator webhook has valid cert √ policy-validator cert is valid for at least 60 days

linkerd-version

√ can determine the latest version ‼ cli is up-to-date is running version 2.14.7 but the latest stable version is 2.14.8 see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version

√ can retrieve the control plane version ‼ control plane is up-to-date is running version 2.14.7 but the latest stable version is 2.14.8 see https://linkerd.io/2.14/checks/#l5d-version-control for hints √ control plane and cli versions match

linkerd-control-plane-proxy

√ control plane proxies are healthy ‼ control plane proxies are up-to-date some proxies are not running the current version: * linkerd-destination-7db7ffc698-m8gw5 (stable-2.14.7) * linkerd-identity-9fb8bf9d-qnldq (stable-2.14.7) * linkerd-proxy-injector-74fc587d88-6jwgb (stable-2.14.7) see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints √ control plane proxies and cli versions match

linkerd-multicluster

√ Link CRD exists √ Link resources are valid * aks2 √ remote cluster access credentials are valid * aks2 √ clusters share trust anchors * aks2 √ service mirror controller has required permissions * aks2 √ service mirror controllers are running * aks2 √ probe services able to communicate with all gateway mirrors * aks2 √ all mirror services have endpoints √ all mirror services are part of a Link √ multicluster extension proxies are healthy ‼ multicluster extension proxies are up-to-date some proxies are not running the current version: * linkerd-gateway-577cd64d56-2p7zj (stable-2.14.7) * linkerd-service-mirror-aks2-54f88cff9f-qtbx8 (stable-2.14.7) see https://linkerd.io/2.14/checks/#l5d-multicluster-proxy-cp-version for hints √ multicluster extension proxies and cli versions match

linkerd-viz

√ linkerd-viz Namespace exists √ can initialize the client √ linkerd-viz ClusterRoles exist √ linkerd-viz ClusterRoleBindings exist √ tap API server has valid cert √ tap API server cert is valid for at least 60 days √ tap API service is running √ linkerd-viz pods are injected √ viz extension pods are running √ viz extension proxies are healthy ‼ viz extension proxies are up-to-date some proxies are not running the current version: * metrics-api-5b744f4799-89czv (stable-2.14.7) * prometheus-64c858cf74-fmq9n (stable-2.14.7) * tap-86b8dbc46d-vsqqk (stable-2.14.7) * tap-injector-6bb795c55-jbkbt (stable-2.14.7) * web-7cfd66547-48ntw (stable-2.14.7) see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints √ viz extension proxies and cli versions match √ prometheus is installed and configured correctly √ viz extension self-check

Status check results are √ Client version: stable-2.14.7 Server version: stable-2.14.7

ferdinandosimonetti avatar Jan 08 '24 10:01 ferdinandosimonetti

So after far longer than I want to admit, the ultimate problem here is that Envoy Gateway is routing directly to endpoint IPs rather than Service IPs (this is tracked by https://github.com/envoyproxy/gateway/issues/1900, although the description isn't great).

A possible workaround is to mesh Envoy Gateway in ingress mode (see https://linkerd.io/2.15/tasks/using-ingress/#ingress-mode) and then add have Envoy Gateway add the l5d-dst-override header. This is a manual thing right now using a ResponseHeaderModifier filter, unfortunately, so it might not help you.

I'm going to close this one, though, since it's an Envoy Gateway limitation and there's already an Envoy Gateway bug tracking it.

kflynn avatar Mar 14 '24 22:03 kflynn