flagger
flagger copied to clipboard
GatewayAPI Session Affinity not honored
Describe the bug
When setting a Canary object to so session affinity with an Kubernete API Gateway like in Session Affinity. I was running a K6 test to verify that users were assigned to a version, and weren't shifted back on a successful deploy.
I noticed that within 1 second, all the users were assigned to the next version.
I believe this is happening because the HTTPRoute being created doesn't pin the user to the primary version.
HTTPRoute
spec:
hostnames:
- charmander.example.com
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: default-gateway
namespace: istio-ingress
rules:
- backendRefs:
- group: ""
kind: Service
name: charmander-primary
port: 9898
weight: 0
- group: ""
kind: Service
name: charmander-canary
port: 9898
weight: 100
matches:
- headers:
- name: Cookie
type: RegularExpression
value: .*flagger-cookie.*nROEvCteRd.*
path:
type: PathPrefix
value: /
- backendRefs:
- group: ""
kind: Service
name: charmander-primary
port: 9898
weight: 95
- filters:
- responseHeaderModifier:
add:
- name: Set-Cookie
value: flagger-cookie=nROEvCteRd; Max-Age=3600
type: ResponseHeaderModifier
group: ""
kind: Service
name: charmander-canary
port: 9898
weight: 5
matches:
- path:
type: PathPrefix
value: /
Note, charmander is a deployment of ghcr.io/stefanprodan/podinfo
To Reproduce
K8s Yaml and K6 script
---
# Source: charmander/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: charmander
namespace: charmander
labels:
app.kubernetes.io/name: charmander
app.kubernetes.io/component: "web"
spec:
minReadySeconds: 5
replicas: 3
revisionHistoryLimit: 5
progressDeadlineSeconds: 60
strategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: charmander
app.kubernetes.io/component: "web"
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9797"
unique-title: 'greetings from deploy v1'
labels:
app.kubernetes.io/name: charmander
app.kubernetes.io/component: "web"
spec:
containers:
- name: podinfod
image: ghcr.io/stefanprodan/podinfo:6.5.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 9898
protocol: TCP
- name: http-metrics
containerPort: 9797
protocol: TCP
- name: grpc
containerPort: 9999
protocol: TCP
command:
- ./podinfo
- --port=9898
- --port-metrics=9797
- --grpc-port=9999
- --grpc-service-name=podinfo
- --level=info
- --random-delay=false
- --random-error=true
env:
- name: PODINFO_UI_COLOR
value: "#34577c"
- name: PODINFO_UI_MESSAGE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['unique-title']
startupProbe:
exec:
command:
- podcli
- check
- http
- localhost:9898/healthz
initialDelaySeconds: 30
timeoutSeconds: 5
resources:
limits:
cpu: 2000m
memory: 512Mi
requests:
cpu: 100m
memory: 64Mi
---
# Source: charmander/templates/canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: charmander-canary
namespace: charmander
spec:
# when set to true, deploy will auto succeed, only use during an emergency.
skipAnalysis: false
# deployment reference
targetRef:
apiVersion: apps/v1
kind: Deployment
name: charmander
# the maximum time in seconds for the canary deployment
# to make progress before it is rollback (default 600s)
progressDeadlineSeconds: 120
service:
gatewayRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: default-gateway
namespace: istio-ingress
hosts:
- 'charmander.example.com'
port: 9898
targetPort: 9898
analysis:
interval: 1m
maxWeight: 50
metrics: []
sessionAffinity:
cookieName: flagger-cookie
maxAge: 3600
stepWeight: 10
threshold: 5
And running the k6 script
import http from 'k6/http';
import { check, sleep } from 'k6';
export const URL = "https://charmander.example.com/"
export const options = {
// A number specifying the number of VUs to run concurrently.
vus: 6,
// A string specifying the total duration of the test run.
duration: '600s',
// Disable clearing cookies
noCookiesReset: true
};
function parseRevision(resp) {
try {
return resp.json().message;
} catch (e) {
return null
}
}
export function setup() {
return { revision: null, changeCount: 0 };
}
export default function (data) {
var resp = http.get(URL);
var revision = parseRevision(resp);
if (data.revision == null) {
console.log(`VU initial version ${revision}`)
data.revision = revision;
}
if (revision && revision !== data.revision) {
data.changeCount++;
console.log(data.revision + " : " + revision)
data.revision = revision;
}
check(resp, { 'changeCount < 2': () => data.changeCount < 2 });
}
export function teardown(data) {
console.log(data);
}
The output looks like
scenarios: (100.00%) 1 scenario, 6 max VUs, 10m30s max duration (incl. graceful stop):
* default: 6 looping VUs for 10m0s (gracefulStop: 30s)
INFO[0000] VU initial version greetings from deploy v2 source=console
INFO[0000] VU initial version greetings from deploy v1 source=console
INFO[0000] VU initial version greetings from deploy v1 source=console
INFO[0000] VU initial version greetings from deploy v2 source=console
INFO[0000] VU initial version greetings from deploy v1 source=console
INFO[0000] VU initial version greetings from deploy v1 source=console
INFO[0000] greetings from deploy v1 : greetings from deploy v2 source=console
INFO[0000] greetings from deploy v1 : greetings from deploy v2 source=console
INFO[0000] greetings from deploy v1 : greetings from deploy v2 source=console
INFO[0001] greetings from deploy v1 : greetings from deploy v2 source=console
INFO[0600] {"changeCount":0,"revision":null} source=console
✓ changeCount < 2
█ setup
█ teardown
checks.........................: 100.00% ✓ 63985 ✗ 0
data_received..................: 27 MB 46 kB/s
data_sent......................: 3.0 MB 4.9 kB/s
http_req_blocked...............: avg=50.85µs min=0s med=1µs max=695.65ms p(90)=1µs p(95)=1µs
http_req_connecting............: avg=11.94µs min=0s med=0s max=86.31ms p(90)=0s p(95)=0s
http_req_duration..............: avg=55.93ms min=33.96ms med=53.5ms max=461.31ms p(90)=64.63ms p(95)=78.13ms
{ expected_response:true }...: avg=56.53ms min=33.96ms med=53.33ms max=461.31ms p(90)=66.94ms p(95)=87.43ms
http_req_failed................: 35.18% ✓ 22515 ✗ 41470
http_req_receiving.............: avg=1.57ms min=6µs med=46µs max=308.44ms p(90)=122µs p(95)=413.79µs
http_req_sending...............: avg=80.69µs min=8µs med=43µs max=26.45ms p(90)=85µs p(95)=130µs
http_req_tls_handshaking.......: avg=32.48µs min=0s med=0s max=301.47ms p(90)=0s p(95)=0s
http_req_waiting...............: avg=54.28ms min=33.81ms med=53.06ms max=461.21ms p(90)=61.73ms p(95)=65.97ms
http_reqs......................: 63985 106.637746/s
iteration_duration.............: avg=56.24ms min=1.79µs med=53.74ms max=772.84ms p(90)=64.99ms p(95)=78.55ms
iterations.....................: 63985 106.637746/s
vus............................: 6 min=6 max=6
vus_max........................: 6 min=6 max=6
running (10m00.0s), 0/6 VUs, 63985 complete and 0 interrupted iterations
default ✓ [======================================] 6 VUs 10m0s
Expected behavior
When running, the users are ~ the correct percent of assigned users.
Additional context
- Flagger version: 1.36.1
- Kubernetes version: 1.25
- Service Mesh provider: GatewayAPI + Istio 1.20.3
- Ingress provider: GatewayAPI
Maybe related to https://github.com/fluxcd/flagger/issues/1532