consul-api-gateway icon indicating copy to clipboard operation
consul-api-gateway copied to clipboard

There doesn't appear to be a way to create an API Gateway, or Gateway per cluster in a federated WAN

Open codex70 opened this issue 3 years ago • 35 comments

Overview of the Issue

I don't seem to be able to set up API gateway in such a way that I can either have access to all mesh services from a single API Gateway, or using and API Gateway per cluster.

Reproduction Steps

  1. Set up an initial cluster using HELM charts and creating an API Gateway (this all works as expected)
  2. Set up a second federated cluster following the instructions here: https://www.consul.io/docs/k8s/installation/multi-cluster/kubernetes
  3. Services in the second datacenter are not accessible to the API Gateway created in the first datacenter cluster.
  4. Using the federated setup, creating a new API Gateway to access services in the second datacenter fail with SSL connection issues.

Logs

Error when trying to add mesh service from second cluster to API Gateway in first cluster

k get httproute/test-service-route -n test -o jsonpath='{.status}' | jq
{
  "parents": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-08-08T07:38:16Z",
          "message": "1 error occurred:\n\t* route is in an invalid state and cannot bind\n\n",
          "observedGeneration": 2,
          "reason": "BindError",
          "status": "False",
          "type": "Accepted"
        },
        {
          "lastTransitionTime": "2022-08-08T07:38:16Z",
          "message": "k8s: service test/test-service not found",
          "observedGeneration": 2,
          "reason": "ServiceNotFound",
          "status": "False",
          "type": "ResolvedRefs"
        }
      ],
      "controllerName": "hashicorp.com/consul-api-gateway-controller",
      "parentRef": {
        "group": "gateway.networking.k8s.io",
        "kind": "Gateway",
        "name": "api-gateway",
        "namespace": "consul"
      }
    }
  ]
}

Error when trying to connect to a second API Gateway in the second datacenter cluster.

curl -vvi -k --header "Host: test-service.api.gateway" "https://${API}:8443/
* TCP_NODELAY set
* Connected to X.X.X.X (X.X.X.X) port 8443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to X.X.X.X:8445
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to X.X.X.X:8445

Expected behavior

There is a documented solution for setting up API Gateways across federated clusters.

Environment details

Additional Context

I suspect this is a simple case of me not seeing the specific documentation required to set this up correctly, but I'm having a lot of problems getting the API Gateway up and running across multiple clusters.

codex70 avatar Aug 08 '22 07:08 codex70

  1. First, are you using the kind: MeshService backend? The ResolvedRefs status condition you're seeing seems to indicate a failure to resolve a Kubernetes Service named test-service in the test namespace, rather than a Consul service - the standard kind: Service Route backend will only find Kuberentes Services in the same Kubernetes cluster, not Consul services outside the Kubernetes cluster to which Consul API Gateway is deployed.

  2. While this doesn't seem to be documented, I believe the functionality of forwarding traffic to Consul services in other datacenters is not yet supported. Consul service resolution from MeshService uses findCatalogService and doesn't specify a Datacenter parameter for api.QueryOptions, which I believe would limit results to Consul services registered in the same datacenter as the Consul agent serving the API request. If you're trying to reach a service from a different Kubernetes cluster registered in the same Consul datacenter though, this may work, but I haven't tested to confirm. https://github.com/hashicorp/consul-api-gateway/blob/145bcc9bf009a21b2170f7c27928bcbdca856c9a/internal/k8s/service/resolver.go#L382-L384

    If using Consul Enterprise, the Consul namespace will be inferred from the connectInject.consulNamespaces configuration, for Consul OSS deployments it will be the default namespace.

  3. I'm not quite sure what would be causing the TLS error when attempting to deploy an API Gateway in a secondary datacenter, but I believe that functionality is likewise not yet supported.

mikemorris avatar Aug 08 '22 18:08 mikemorris

Thanks for getting back to me about this, it definitely helps explain what's going on. I did try MeshService, but it complained about the type (will check the error message, but I suspect I need to apply the following: https://github.com/hashicorp/consul-api-gateway/blob/main/config/crd/bases/api-gateway.consul.hashicorp.com_meshservices.yaml)

I will investigate this in more detail tomorrow and let you know how I get on. I have two options one is the Single Consul Datacenter in Multiple Kubernetes Clusters (https://www.consul.io/docs/k8s/installation/deployment-configurations/single-dc-multi-k8s) and the other Federation Between Kubernetes Clusters (https://www.consul.io/docs/k8s/installation/multi-cluster/kubernetes). I have managed to get either option working with varying degrees of success for cross cluster and service mesh communication.

Anyway, I will do more testing and update the thread tomorrow.

codex70 avatar Aug 08 '22 18:08 codex70

Missing CRD would definitely explain not being able to use MeshService, make sure you're installing the CRDs as described at https://www.consul.io/docs/api-gateway/consul-api-gateway-install#installation to get Consul API Gateway's custom CRDs (such as MeshService) in addition to the upstream Gateway API CRDs.

Definitely let us know how anything you manage to get working, and we'll consider proper support for federated services as a feature for our roadmap.

mikemorris avatar Aug 08 '22 18:08 mikemorris

@mikemorris , I was hoping to have a look at this, but realised that whatever configuration changes I have made, the cross cluster service mesh connection through the mesh gateway is now broken for Kafka. I was running kafka inside the service mesh and it was working. I've tried to roll back my changes but can't get it working again. It seems difficult for me to debug the issue. Is it work mentioning it here, open another ticket, or is there a better place to seek support for the mesh gateway?

codex70 avatar Aug 10 '22 15:08 codex70

By the way, I checked the CRDs, I had installed, but for a previous version, perhaps that will fix some of the issues: As for the kafka problem, I've opened a separate issue as it's something very different: https://github.com/hashicorp/consul/issues/14125 I will get back to you about this as soon as the kafka issue is fixed.

codex70 avatar Aug 10 '22 17:08 codex70

Looks like https://github.com/hashicorp/consul-k8s/issues/1344 is tracking the issue currently preventing creation of a Gateway in secondary datacenters in a WAN-federated Consul deployment.

mikemorris avatar Aug 16 '22 17:08 mikemorris

Thanks @mikemorris, as you can see I've added my comment there as well. I've also fixed the issue I had with implementing kafka which now frees me up to do some more testing on the API gateway

codex70 avatar Aug 16 '22 17:08 codex70

@mikemorris I've now been able to do some more testing, if I add in kind: MeshService I get the following error when looking at the route's status:

  "parents": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-08-17T10:33:01Z",
          "message": "1 error occurred:\n\t* route is in an invalid state and cannot bind\n\n",
          "observedGeneration": 2,
          "reason": "BindError",
          "status": "False",
          "type": "Accepted"
        },
        {
          "lastTransitionTime": "2022-08-17T10:33:01Z",
          "message": "unsupported reference type",
          "observedGeneration": 2,
          "reason": "Errors",
          "status": "False",
          "type": "ResolvedRefs"
        }
      ],
      "controllerName": "hashicorp.com/consul-api-gateway-controller",
      "parentRef": {
        "group": "gateway.networking.k8s.io",
        "kind": "Gateway",
        "name": "api-gateway",
        "namespace": "consul"
      }
    }

codex70 avatar Aug 17 '22 10:08 codex70

More importantly though, is there a way of debugging an HttpRoute? I've currently only got one route that's working, the second route looks like everything is correct, but when I try to curl the endpoint, it returns a 404 error. I can't see anything in any of the logs to tell me where the error is.

codex70 avatar Aug 17 '22 10:08 codex70

More importantly though, is there a way of debugging an HttpRoute?

How you've been doing it so far is correct - first checking the route status field, then controller logs - if something isn't implemented correctly it may be helpful to dump the actual applied Envoy config, but this should be enough to debug most cases (and when it's not, we could likely benefit from contributions improving status messages, logs, or docs).

A route is only "applied/in effect" when its type: Accepted condition has status: True (hence the 404 for no match), and would only successfully route to a backend when type: ResolvedRefs also has status: True.

if I add in kind: MeshService I get the following error when looking at the route's status:

"message": "unsupported reference type",
"status": "False",
"type": "ResolvedRefs"

In addition to specifying kind: MeshService, it would also be necessary to set group: api-gateway.consul.hashicorp.com in that BackendRef, as Group will default to the core API group of kind: Service if unspecified (the mismatch is causing the unsupported reference type error message - it's looking for a MeshService kind in the core API group, where it doesn't exist - if the CRD was installed, it should exist in our implementation-specific group).

This is documented in the Routes configuration docs, but should probably be mentioned in MeshService too.

mikemorris avatar Aug 23 '22 19:08 mikemorris

@codex70 @manobi I recorded a demo yesterday pulling together the 3 related PRs that will be included across the upcoming consul-k8s v0.49.0 and consul-api-gateway v0.5.0 releases to support Gateway per cluster in a federated setup:

  • https://github.com/hashicorp/consul-k8s/pull/1481
  • https://github.com/hashicorp/consul-api-gateway/pull/368
  • https://github.com/hashicorp/consul-k8s/pull/1511

Note This adds support for a Gateway in the secondary datacenter routing to services within the same datacenter. This does not add support for routing from a Gateway in one datacenter to services in another datacenter. This is now reflected in our docs which will be updated again when the releases referenced above are completed.

https://user-images.githubusercontent.com/3476400/193070791-541d526e-2606-4560-84a4-1136f12c56f4.mp4

nathancoleman avatar Sep 29 '22 15:09 nathancoleman

@nathancoleman I'll try this soon, thank you for sharing.

manobi avatar Sep 29 '22 15:09 manobi

@nathancoleman I've tried with consul-k8s (0.49.0) and hashicorppreview/consul-api-gateway:0.5-dev but still:

2022-10-02T00:09:03.658Z [ERROR] consul/certmanager.go:257: consul-api-gateway-server.cert-manager: error grabbing leaf certificate: error="Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied: token with AccessorID 'REDACTED' lacks permission 'service:write' on \"consul-api-gateway-controller\")"

This is what it looks like in consul ui on "DC2" (AcessorIDs and datacenter name have being redacted): Screen Shot 2022-10-01 at 22 43 44 Screen Shot 2022-10-01 at 22 40 16 Screen Shot 2022-10-01 at 22 40 24

PS: my DC1 is still running consul-k8s v0.48.0 and many federated datacenters connected (31) each in a different version.

manobi avatar Oct 02 '22 00:10 manobi

Hi @manobi :wave: I was able to get everything working w/ fresh clusters/datacenters using 0.48.0 for the primary dc and 0.49.0 for the secondary dc. I do notice though that the role for the controller in my case has a policy attached where yours does not. I'm looking into how this could have come to be in your case. Does an analogous policy (api-gateway-controller-policy-<dc_name>) exist in your UI and just isn't attached to the role, or does the policy not exist at all?

PS: any chance you could share your values.yaml files? Also curious if you did an upgrade with the Gateway already existing in your K8s cluster from when you had consul-k8s 0.48.0 installed, or did you recreate it after installing 0.49.0?

image

nathancoleman avatar Oct 03 '22 19:10 nathancoleman

Hi @nathancoleman The policy does exists and when the secondary datacenter was created there was already a registered Gateway in primary dc (v0.48.0).

Screen Shot 2022-10-03 at 16 34 57
apiGateway:
  enabled: true
  image: hashicorppreview/consul-api-gateway:0.5-dev
  managedGatewayClass:
    copyAnnotations:
      service:
        annotations: |
          - service.beta.kubernetes.io/aws-load-balancer-backend-protocol
          - service.beta.kubernetes.io/aws-load-balancer-name
          - service.beta.kubernetes.io/aws-load-balancer-nlb-target-type
          - service.beta.kubernetes.io/aws-load-balancer-scheme
          - service.beta.kubernetes.io/aws-load-balancer-type
          - service.beta.kubernetes.io/aws-load-balancer-ssl-cert
client:
  extraConfig: |
    {
      "leave_on_terminate": true,
      "advertise_reconnect_timeout": "60s",
      "limits": {
        "http_max_conns_per_client": 65535
      }
    }
  priorityClassName: heaviest
  resources:
    limits:
      cpu: 100m
      memory: 350Mi
    requests:
      cpu: 20m
      memory: 200Mi
connectInject:
  default: false
  enabled: true
  metrics:
    defaultEnableMerging: false
    defaultEnabled: false
  resources:
    limits:
      cpu: 50m
      memory: 180Mi
    requests:
      cpu: 50m
      memory: 180Mi
  sidecarProxy:
    resources:
      limits:
        cpu: 100m
        memory: 100Mi
      requests:
        cpu: 13m
        memory: 81Mi
controller:
  enabled: true
  resources:
    limits:
      cpu: 100m
      memory: 50Mi
    requests:
      cpu: 100m
      memory: 50Mi
global:
  acls:
    createReplicationToken: false
    manageSystemACLs: true
    replicationToken:
      secretKey: replicationToken
      secretName: consul-consul-federation
  consulAPITimeout: 5m
  datacenter: qa-ecommerce
  enableGatewayMetrics: true
  federation:
    enabled: true
    k8sAuthMethodHost: <REDACTED>
    primaryDatacenter: dc1
  metrics:
    agentMetricsRetentionTime: 1m
    baseURL: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    enableGatewayMetrics: true
    enabled: true
  tls:
    caCert:
      secretKey: caCert
      secretName: consul-consul-federation
    caKey:
      secretKey: caKey
      secretName: consul-consul-federation
    enabled: true
ingressGateways:
  defaults:
    service:
      annotations: |
        "service.beta.kubernetes.io/aws-load-balancer-name": "qa-ecommerce-consul-ingress-gate"
        "service.beta.kubernetes.io/aws-load-balancer-nlb-target-type": "ip"
        "service.beta.kubernetes.io/aws-load-balancer-scheme": "internal"
        "service.beta.kubernetes.io/aws-load-balancer-ssl-cert": ""
        "service.beta.kubernetes.io/aws-load-balancer-type": "nlb-ip"
      ports:
      - nodePort: null
        port: 443
      type: LoadBalancer
  enabled: false
  gateways:
  - name: ingress-gateway
  resources:
    limits:
      cpu: 400m
      memory: 150Mi
    requests:
      cpu: 160m
      memory: 100Mi
meshGateway:
  enabled: true
  replicas: 1
  resources:
    limits:
      cpu: 300m
      memory: 100Mi
    requests:
      cpu: 100m
      memory: 100Mi
  service:
    annotations: |
      "service.beta.kubernetes.io/aws-load-balancer-backend-protocol": "ssl"
      "service.beta.kubernetes.io/aws-load-balancer-internal": "true"
      "service.beta.kubernetes.io/aws-load-balancer-name": "qa-ecommerce-consul-mesh-gateway"
      "service.beta.kubernetes.io/aws-load-balancer-nlb-target-type": "ip"
      "service.beta.kubernetes.io/aws-load-balancer-scheme": "internal"
      "service.beta.kubernetes.io/aws-load-balancer-type": "nlb-ip"
server:
  extraConfig: |
    {
      "ui_config": {
        "enabled": true,
        "metrics_provider": "prometheus",
        "metrics_proxy": {
          "base_url": "http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090"
        },
        "dashboard_url_templates": {
          "service": "<redacted>"
        }
      }
    }
  extraVolumes:
  - items:
    - key: serverConfigJSON
      path: config.json
    load: true
    name: consul-consul-federation
    type: secret
  nodeSelector: ""
  priorityClassName: heavy
  resources:
    limits:
      cpu: 500m
      memory: 700Mi
    requests:
      cpu: 250m
      memory: 400Mi
ui:
  metrics:
    baseURL: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    enabled: true
    provider: prometheus

manobi avatar Oct 03 '22 19:10 manobi

@manobi if you apply that policy to the role analogous to the one I screenshotted, does everything work for you standing up a Gateway in the secondary dc?

nathancoleman avatar Oct 03 '22 19:10 nathancoleman

@nathancoleman From the UI it's not working, the browser crashes while loading the policy options. Maybe there is too much roles/policies and the same error happens during tokens bootstrap?

consul acl policy list -token=<redacted> | grep ID | wc -l
252
consul acl role update -id=16382188-2b3f-a628-a434-af342bf2f97e -policy-id=d1acd2a4-bffc-7ddf-63b5-14af3f338417 -token=<redacted>

After that the consul-api-gateway-controller seems to be running, but how I can make sure it will work the next time I upgrade?

manobi avatar Oct 03 '22 20:10 manobi

@manobi I'm hoping to understand why it failed in this case. Any chance you have the logs from the consul-api-gateway-controller pod's api-gateway-controller-acl-init container when this failed? It seems like the logic to bind the policy to the role here failed

nathancoleman avatar Oct 03 '22 20:10 nathancoleman

Even after the manual attachment the api-gateway-controller-acl-init failed twice, before started running with the following logs:

2022-10-03T20:14:33.393Z [INFO] Consul login complete
2022-10-03T20:14:33.393Z [INFO] Checking that the ACL token exists when reading it in the stale consistency mode
2022-10-03T20:14:33.394Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.497Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.598Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.701Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.803Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.905Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.008Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.110Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.214Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.316Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.418Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.520Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.623Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.725Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.827Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"

I've noticed a similar behaviour with mesh-gateway and controller components as well. After your direction and the UI crashing I'm starting to believe it is skipping the binding rules list somehow, when there are many items to process.

Might be not related to api-gateway but some consul-k8s bug.

manobi avatar Oct 03 '22 20:10 manobi

@manobi that would make sense as the possible cause. That scale is the main difference between my temporary setups and your own. I'll be traveling most of this week but will see if I can find out anything once I'm back.

nathancoleman avatar Oct 03 '22 21:10 nathancoleman

The 403 (ACL not found) errors look like they could be a manifestation of https://github.com/hashicorp/consul-k8s/pull/887

@nathancoleman could we maybe implement the same workaround as consul-ecs did in https://github.com/hashicorp/consul-ecs/pull/79 until Consul adds "read your writes" support for an improved consul login UX (without the performance overhead of switching to consistent reads)?

mikemorris avatar Oct 04 '22 18:10 mikemorris

@mikemorris Given that my api-gateway-controller is running and I have deployed the Gateway resource, when I apply the ReferenceGrant and HTTPRoute in my secondary dc the routing does not seem to be working.

Is there a way to debug if the routing have actually being registered? Unlike Gateways in primary dc consul ui does not show connections between gateway and target service.

With log-level=trace enabled I saw the following status:

"conditions": [
  |     {
  |       "type": "Ready",
  |       "status": "True",
  |       "observedGeneration": 1,
  |       "lastTransitionTime": "2022-10-04T22:52:16Z",
  |       "reason": "Ready",
  |       "message": "Ready"
  |     },
  |     {
  |       "type": "Scheduled",
  |       "status": "True",
  |       "observedGeneration": 1,
  |       "lastTransitionTime": "2022-10-04T22:52:16Z",
  |       "reason": "Scheduled",
  |       "message": "Scheduled"
  |     },
  |     {
  |       "type": "InSync",
  |       "status": "False",
  |       "observedGeneration": 1,
  |       "lastTransitionTime": "2022-10-04T22:52:16Z",
  |       "reason": "SyncError",
  |       "message": "error adding ingress config entry: 1 error occurred:\n\t* Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied: token with AccessorID '0323cd06-e494-1d61-2cc9-3f8570954046' lacks permission 'mesh:write')\n\n"
  |     }
  |   ],

HTTPRoute resource status seems to be ok but it's working:

status:
  parents:
    - conditions:
        - lastTransitionTime: '2022-10-04T23:04:20Z'
          message: Route accepted.
          observedGeneration: 1
          reason: Accepted
          status: 'True'
          type: Accepted
        - lastTransitionTime: '2022-10-04T23:04:20Z'
          message: ResolvedRefs
          observedGeneration: 1
          reason: ResolvedRefs
          status: 'True'
          type: ResolvedRefs

Upstreams in secondary DC (0): Screen Shot 2022-10-04 at 20 15 12

Upstreams in primary DC (1): Screen Shot 2022-10-04 at 20 15 42


consul-k8s proxy read <gateway-pod-name> -context=dc2:

==> Clusters (3)
==> Endpoints (3)	
==> Listeners (1)
==> Routes (1)
==> Secrets (2)

consul-k8s proxy read <gateway-pod-name> -context=dc1:

==> Clusters (6)
==> Endpoints (6)
==> Listeners (2)
==> Routes (1)
==> Secrets (2)

manobi avatar Oct 04 '22 23:10 manobi

Hi @manobi , were you able to get this working? Just to clarify, your Gateway, HTTPRoute, ReferenceGrant and backend Service that the route is targeting are all in the secondary datacenter, correct?

nathancoleman avatar Oct 11 '22 21:10 nathancoleman

Hi @manobi , were you able to get this working? Just to clarify, your Gateway, HTTPRoute, ReferenceGrant and backend Service that the route is targeting are all in the secondary datacenter, correct?

Yes they are all running in the secondary datacenter, but I have not being able to get this working. Still seeing the following in api-gateway-controller:

error adding ingress config entry: 1 error occurred:\n\t* Unexpected response code: 403 (rpc error making call: rpc error making call: rpc error making call: Permission denied: token with AccessorID '0323cd06-e494-1d61-2cc9-3f8570954046' lacks permission 'mesh:write')\n\n

How can I force this "mesh:write" permission ?

manobi avatar Oct 12 '22 00:10 manobi

https://github.com/hashicorp/consul-api-gateway/blob/8f9040100434a648713a55f30950c182e29f5c22/internal/adapters/consul/sync.go#L354

The gateway deployment is running in secondary datacenter, but there is no service-default or ingress-gateway registered. What policy should api-gateway-controller use to able to register those configs?

manobi avatar Oct 13 '22 15:10 manobi

@manobi I'd expect it to be using api-gateway-controller-policy-<datacenter> which has the higher-level operator = "write" permission. You can see what I'm expecting in the screenshot a ways up https://github.com/hashicorp/consul-api-gateway/issues/300#issuecomment-1265925913.

It makes sense that the config entries aren't registered because the controller isn't able to create them in your setup. I'm not yet sure why this is, and I haven't been able to reproduce it myself.

Just to be certain, to replicate your setup, I need consul-k8s v0.48.0 in my primary datacenter and consul-k8s v0.49.0 in my secondary datacenter. Is that accurate? Are you using consul-api-gateway v0.5-dev in both datacenters?

nathancoleman avatar Oct 13 '22 17:10 nathancoleman

@nathancoleman The only way I've managed to make it work was by attaching thecontroller-policy in api-gateway-controller token.

My current setup is the following one:

Primary datacenter:

  • consul-k8s: v0.48.0
  • hashicorp/consul-api-gateway:0.4.0
  • hashicorp/consul:1.13.2

Secondary datacenter:

  • consul-k8s:v0.49.0
  • hashicorppreview/consul-api-gateway:0.5-dev-b98d845e31176332d7c65884f08d1e95ff2897c6
  • hashicorp/consul:1.13.2

manobi avatar Oct 13 '22 21:10 manobi

@manobi here's a writeup of the whole process I went through to replicate the issue, but I'm still seeing everything work. I figure at least this will show what the Kubernetes Deployment and Consul roles+policies for the consul-api-gateway-controller should look like. Can you take a look and let me know if anything I'm doing doesn't match your setup or if you can identify the diff between my resulting config and yours? Feel free to comment right on the gist if you like.

https://gist.github.com/nathancoleman/076343780c3e0b4c03fb91f9d4f84616

nathancoleman avatar Oct 13 '22 23:10 nathancoleman

@nathancoleman thank you, I'll try to reproduce your steps. The manual changes I have done, allowed me to test other things. Do you think something changed in 0.5 that would break URLrewrite?

The service router is not reading the filters with URLRewrite:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: my-service
  namespace: consul
spec:
  parentRefs:
  - name: digital-api-qa
  rules:
    - matches:
      - path:
          type: PathPrefix
          value: "/my-service/v1"
      backendRefs:
        - kind: Service
          name: my-service
          namespace: my-service
          port: 80
          weight: 100
      filters:
      - type: URLRewrite
        urlRewrite:
          path:
            type: ReplacePrefixMatch
            replacePrefixMatch: "/api/v1"

Becomes:

{
    "Kind": "service-router",
    "Name": "digital-api-qa-735653bb",
    "Routes": [
        {
            "Match": {
                "HTTP": {
                    "PathPrefix": "/my-service/v1"
                }
            },
            "Destination": {
                "Service": "my-service",
                "RequestHeaders": {}
            }
        }
    ],
    "Meta": {
        "consul-api-gateway/k8s/Gateway.Name": "digital-api-qa",
        "consul-api-gateway/k8s/Gateway.Namespace": "consul",
        "external-source": "consul-api-gateway"
    },
    "CreateIndex": 242705,
    "ModifyIndex": 242705
}

manobi avatar Oct 14 '22 00:10 manobi

@manobi thanks for calling that out. Fixed in https://github.com/hashicorp/consul-api-gateway/pull/414

nathancoleman avatar Oct 14 '22 22:10 nathancoleman