GatewayAPI Session Affinity not honored

Open ethankhall opened this issue 1 year ago • 1 comments

Describe the bug

When setting a Canary object to so session affinity with an Kubernete API Gateway like in Session Affinity. I was running a K6 test to verify that users were assigned to a version, and weren't shifted back on a successful deploy.

I noticed that within 1 second, all the users were assigned to the next version.

I believe this is happening because the HTTPRoute being created doesn't pin the user to the primary version.

HTTPRoute

spec:
  hostnames:
  - charmander.example.com
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: default-gateway
    namespace: istio-ingress
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: charmander-primary
      port: 9898
      weight: 0
    - group: ""
      kind: Service
      name: charmander-canary
      port: 9898
      weight: 100
    matches:
    - headers:
      - name: Cookie
        type: RegularExpression
        value: .*flagger-cookie.*nROEvCteRd.*
      path:
        type: PathPrefix
        value: /
  - backendRefs:
    - group: ""
      kind: Service
      name: charmander-primary
      port: 9898
      weight: 95
    - filters:
      - responseHeaderModifier:
          add:
          - name: Set-Cookie
            value: flagger-cookie=nROEvCteRd; Max-Age=3600
        type: ResponseHeaderModifier
      group: ""
      kind: Service
      name: charmander-canary
      port: 9898
      weight: 5
    matches:
    - path:
        type: PathPrefix
        value: /

Note, charmander is a deployment of ghcr.io/stefanprodan/podinfo

To Reproduce

K8s Yaml and K6 script

---
# Source: charmander/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: charmander
  namespace: charmander
  labels:
    app.kubernetes.io/name: charmander
    app.kubernetes.io/component: "web"
spec:
  minReadySeconds: 5
  replicas: 3
  revisionHistoryLimit: 5
  progressDeadlineSeconds: 60
  strategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: charmander
      app.kubernetes.io/component: "web"
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9797"
        unique-title: 'greetings from deploy v1'
      labels:
        app.kubernetes.io/name: charmander
        app.kubernetes.io/component: "web"
    spec:
      containers:
      - name: podinfod
        image: ghcr.io/stefanprodan/podinfo:6.5.0
        imagePullPolicy: IfNotPresent
        ports:
        - name: http
          containerPort: 9898
          protocol: TCP
        - name: http-metrics
          containerPort: 9797
          protocol: TCP
        - name: grpc
          containerPort: 9999
          protocol: TCP
        command:
        - ./podinfo
        - --port=9898
        - --port-metrics=9797
        - --grpc-port=9999
        - --grpc-service-name=podinfo
        - --level=info
        - --random-delay=false
        - --random-error=true
        env:
        - name: PODINFO_UI_COLOR
          value: "#34577c"
        - name: PODINFO_UI_MESSAGE
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['unique-title']
        startupProbe:
          exec:
            command:
            - podcli
            - check
            - http
            - localhost:9898/healthz
          initialDelaySeconds: 30
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 2000m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 64Mi
---
# Source: charmander/templates/canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: charmander-canary
  namespace: charmander
spec:
# when set to true, deploy will auto succeed, only use during an emergency.
  skipAnalysis: false
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: charmander
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 120
  service:
    gatewayRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: default-gateway
      namespace: istio-ingress
    hosts:
    - 'charmander.example.com'
    port: 9898
    targetPort: 9898
  analysis:
    interval: 1m
    maxWeight: 50
    metrics: []
    sessionAffinity:
      cookieName: flagger-cookie
      maxAge: 3600
    stepWeight: 10
    threshold: 5

And running the k6 script

import http from 'k6/http';
import { check, sleep } from 'k6';

export const URL = "https://charmander.example.com/"
export const options = {
    // A number specifying the number of VUs to run concurrently.
    vus: 6,
    // A string specifying the total duration of the test run.
    duration: '600s',
    // Disable clearing cookies
    noCookiesReset: true
};

function parseRevision(resp) {
    try {
        return resp.json().message;
    } catch (e) {
        return null
    }
}

export function setup() {
    return { revision: null, changeCount: 0 };
}

export default function (data) {
    var resp = http.get(URL);

    var revision = parseRevision(resp);
    if (data.revision == null) {
        console.log(`VU initial version ${revision}`)
        data.revision = revision;
    }

    if (revision && revision !== data.revision) {
        data.changeCount++;
        console.log(data.revision + " : " + revision)
        data.revision = revision;
    }

    check(resp, { 'changeCount < 2': () => data.changeCount < 2 });
}

export function teardown(data) {
    console.log(data);
}

The output looks like

    scenarios: (100.00%) 1 scenario, 6 max VUs, 10m30s max duration (incl. graceful stop):
              * default: 6 looping VUs for 10m0s (gracefulStop: 30s)

INFO[0000] VU initial version greetings from deploy v2   source=console
INFO[0000] VU initial version greetings from deploy v1   source=console
INFO[0000] VU initial version greetings from deploy v1   source=console
INFO[0000] VU initial version greetings from deploy v2   source=console
INFO[0000] VU initial version greetings from deploy v1   source=console
INFO[0000] VU initial version greetings from deploy v1   source=console
INFO[0000] greetings from deploy v1 : greetings from deploy v2  source=console
INFO[0000] greetings from deploy v1 : greetings from deploy v2  source=console
INFO[0000] greetings from deploy v1 : greetings from deploy v2  source=console
INFO[0001] greetings from deploy v1 : greetings from deploy v2  source=console
INFO[0600] {"changeCount":0,"revision":null}             source=console

     ✓ changeCount < 2

     █ setup

     █ teardown

     checks.........................: 100.00% ✓ 63985      ✗ 0
     data_received..................: 27 MB   46 kB/s
     data_sent......................: 3.0 MB  4.9 kB/s
     http_req_blocked...............: avg=50.85µs min=0s      med=1µs     max=695.65ms p(90)=1µs     p(95)=1µs
     http_req_connecting............: avg=11.94µs min=0s      med=0s      max=86.31ms  p(90)=0s      p(95)=0s
     http_req_duration..............: avg=55.93ms min=33.96ms med=53.5ms  max=461.31ms p(90)=64.63ms p(95)=78.13ms
       { expected_response:true }...: avg=56.53ms min=33.96ms med=53.33ms max=461.31ms p(90)=66.94ms p(95)=87.43ms
     http_req_failed................: 35.18%  ✓ 22515      ✗ 41470
     http_req_receiving.............: avg=1.57ms  min=6µs     med=46µs    max=308.44ms p(90)=122µs   p(95)=413.79µs
     http_req_sending...............: avg=80.69µs min=8µs     med=43µs    max=26.45ms  p(90)=85µs    p(95)=130µs
     http_req_tls_handshaking.......: avg=32.48µs min=0s      med=0s      max=301.47ms p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=54.28ms min=33.81ms med=53.06ms max=461.21ms p(90)=61.73ms p(95)=65.97ms
     http_reqs......................: 63985   106.637746/s
     iteration_duration.............: avg=56.24ms min=1.79µs  med=53.74ms max=772.84ms p(90)=64.99ms p(95)=78.55ms
     iterations.....................: 63985   106.637746/s
     vus............................: 6       min=6        max=6
     vus_max........................: 6       min=6        max=6


running (10m00.0s), 0/6 VUs, 63985 complete and 0 interrupted iterations
default ✓ [======================================] 6 VUs  10m0s

Expected behavior

When running, the users are ~ the correct percent of assigned users.

Additional context

Flagger version: 1.36.1
Kubernetes version: 1.25
Service Mesh provider: GatewayAPI + Istio 1.20.3
Ingress provider: GatewayAPI

May 20 '24 20:05 ethankhall

Maybe related to https://github.com/fluxcd/flagger/issues/1532

May 20 '24 21:05 ethankhall