rollouts-plugin-trafficrouter-gatewayapi
rollouts-plugin-trafficrouter-gatewayapi copied to clipboard
Experience running this plugin with gRPCRoutes in a linkerd-meshed cluster
It appears we're among the first to test out this plugin with linkerd and grpcroutes so I thought I'd share some knowledge which might help others.
We're running a custom build (just from trunk) hosted in s3 and injecting that into argo rollouts using the helm chart:
controller:
trafficRouterPlugins:
trafficRouterPlugins: |-
- name: "argoproj-labs/gatewayAPI"
location: "https://********/gatewayapi-plugin-linux-amd64"
Our grpcRoutes look something like this:
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: GRPCRoute
metadata:
annotations:
retry.linkerd.io/grpc: cancelled,deadline-exceeded,resource-exhausted,unavailable,internal
retry.linkerd.io/limit: "3"
retry.linkerd.io/timeout: 300ms
name: abc-grpc-route-query
namespace: abc
spec:
parentRefs:
- group: ""
kind: Service
name: init
port: 80
rules:
- backendRefs:
- group: ""
kind: Service
name: xyz
port: 80
weight: 100
- group: ""
kind: Service
name: xyz-canary
port: 80
weight: 0
matches:
- method:
method: xyz
service: abc.xyz
type: Exact
Our rollouts have the following canary strategy configuration:
trafficRouting:
plugins:
argoproj-labs/gatewayAPI:
grpcRoutes:
- name: xyz-read-grpc-route-query
- name: xyz-read-grpc-route-command
namespace: init
We just rolled this out to our staging cluster, and the grpcRoutes seem to update just fine in realtime like they're supposed to. I'm going to try to get some metrics from linkerd to see how it all works and post that here within a couple of days.
We're running linkerd-enterprise-control-plane helm chart version 2.16 which introduced support for retries in grpcroutes, which was our motivation for migrating everything over to the new gateway api.
One thing we did run into while setting this up was the CRD incompatibility between linkerd-crds and this plugin. linkerd-crds installs httpRoute v1alpha2 whereas this plugin expects v1. This was relatively easy to bypass as traefik which we also use ships with v1 crds.
So we've deployed this all to our cluster, but I'm having a hard time verifying if it's working as we're only using the HTTP/GRPC routes for traffic routing during canary rollouts. Our linkerd-proxy metrics show no data for routes. I'm going to try to create a minimal repro locally with kind to see if it actually uses the HTTP/GRPCroutes.
Well I just got to try HTTPRoute using the podinfo docker image to verify, and it seems to work very nicely with linkerd. I've created a simple repro here which is a bit messy, but it works π
https://github.com/kvist-no/linkerd-gateway-api-repro.
The way it works:
- traefik with httproute pointing to stable svc
- create a httproute with parentref pointing to stable svc and backendrefs stable and canary svc
- use this plugin to point to the latter httproute
With this, it seems to work flawlessly! I'm going to test grpcroutes next and ensure they work as well
Well I just got to try HTTPRoute using the
podinfodocker image to verify, and it seems to work very nicely with linkerd. I've created a simple repro here which is a bit messy, but it works π https://github.com/kvist-no/linkerd-gateway-api-repro.The way it works:
traefik with httproute pointing to stable svc
create a httproute with parentref pointing to stable svc and backendrefs stable and canary svc
use this plugin to point to the latter httproute
With this, it seems to work flawlessly! I'm going to test grpcroutes next and ensure they work as well
Thank you @FredrikAugust for feedback!π
No worries. Status now is that I've confirmed it works fine with Traefik -> Linkerd + this plugin with Argo rollouts. What's missing is testing that GRPC works as it should which is a little more tricky as Traefik as per now doesn't support GRPCRoutes.
I'll try to test this tomorrow by running a simple application which connects to stable and just calls the Info service of podinfo (which returns hostname) over and over and logs the result. That way it should be easy to see that the split is ~ 50/50 and that canary deploys are working in terms of the traffic splits. And that the retries are working as they should for GRPC (I've confirmed they're good for HTTP).
Okay, so I got around to creating the helper tool: https://github.com/kvist-no/grpc-lb-tester.
It does two things, every n seconds it sends two gRPC queries to the podinfo backends
Infoto get the hostname (useful for testing that canary routing works as expected)Statuswith code:Unavailable(useful for testingretry.linkerd.iofunctionality)
And the verdict is that it all seems to work.
I first set up two backends for stable, and ensured that they each got ~50% traffic (this is controlled by LB algo of l5d). Then I triggered a rollout upgrade and set the steps to
- weight = 50%
- pause
When it paused after 50% I ensured that, again the (# 1) canary pod would get ~50% traffic and the (# 2) stable ones got the other 50%.
Then I promoted the rollout and the weight flipped to 100% stable and 0% canary, and the traffic routed accordingly. I also tried undo-ing a rollout and that seemed to work fine.
Secondly, I tested the retry.linkerd.io functionality which I wasn't sure of was going to work as linkerd and traefik use different CRDs for HTTPRoutes, but they also worked fine.
For testing HTTP, I simply ran time curl localhost:8000/status/500 and saw that the response times increased as I upped the retry count for the HTTPRoute, and for gRPC I did the same, only looking at the time it took for my helper tool to get a response (see image).
I don't think there is anything left to test from the subset of functionality that we will use, but I can loop back if we encounter any problems in production. Thank you for the great plugin, it works very well! I hope we can see a release of the gRPC functionality soon π
I'll update this once I get a response from linkerd in regards to support for grpcroute v1.
So I've gotten a response from linkerd, and they want to support v1, but don't have a timeline for it per now. https://github.com/linkerd/linkerd2/issues/13032
@FredrikAugust 0.4.0 was just released and it includes grpc support https://github.com/argoproj-labs/rollouts-plugin-trafficrouter-gatewayapi/releases/tag/v0.4.0
Awesome, @kostis-codefresh! I don't think we'll be able to test it before linkerd upgrades to stable though, unless there is a way to configure the version used in this plugin βΒ which I don't think there is.
Would it be a bad idea to allow to control the api version used through an environment variable?