agones
agones copied to clipboard
bad flakiness in e2e tests on Autopilot 1.29
Seems to be pretty consistent on these tests:
time="2024-04-05 17:26:01.545" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc000785730 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:03.546" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc00047cfb0 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:05.545" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc00059f890 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:07.544" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc000785bd0 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:09.155" level=info msg="error waiting for fleet condition, dumping Fleet and Gameserver data" error="context deadline exceeded" fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil> Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:09.155" level=info msg="Dumping Events:" fleet=simple-fleet-1.0l4pgd kind= test=TestAllocatorWithSelectors
time="2024-04-05 17:26:09.292" level=info msg="Event!" fleet=simple-fleet-1.0l4pgd lastTimestamp="2024-04-05 17:16:09 +0000 UTC" message="Created GameServerSet simple-fleet-1.0l4pgd-qzg65" reason=CreatingGameServerSet test=TestAllocatorWithSelectors type=Normal
framework.go:346:
Error Trace: /go/src/agones.dev/agones/test/e2e/framework/framework.go:346
/go/src/agones.dev/agones/test/e2e/allocator_test.go:172
Error: Received unexpected error:
context deadline exceeded
Test: TestAllocatorWithSelectors
Messages: error waiting for fleet condition on fleet: simple-fleet-1.0l4pgd
--- FAIL: TestAllocatorWithSelectors (602.32s)
Also
time="2024-04-05 17:06:04.671" level=info msg="ServiceAccount 1712336764/agones-sdk is created"
time="2024-04-05 17:06:04.846" level=info msg="RoleBinding 1712336764/agones-sdk-access is created"
panic: test timed out after 30m0s
running tests:
TestRestAllocatorWithDeprecatedRequired (9m55s)
goroutine 964 [running]:
testing.(*M).startAlarm.func1()
/usr/local/go/src/testing/testing.go:2259 +0x259
created by time.goFunc
/usr/local/go/src/time/sleep.go:176 +0x45
goroutine 1 [chan receive, 11 minutes]:
testing.(*T).Run(0xc000105860, {0x2a4a031, 0x27}, 0x2b1a110)
/usr/local/go/src/testing/testing.go:1649 +0x871
testing.runTests.func1(0x0?)
/usr/local/go/src/testing/testing.go:2054 +0x85
testing.tRunner(0xc000105860, 0xc000821868)
/usr/local/go/src/testing/testing.go:1595 +0x262
testing.runTests(0xc0001aae60?, {0x3c47920, 0x61, 0x61}, {0xc0008218e0?, 0x441c74?, 0x3c59f20?})
/usr/local/go/src/testing/testing.go:2052 +0x8ae
testing.(*M).Run(0xc0001aae60)
/usr/local/go/src/testing/testing.go:1925 +0xcd8
agones.dev/agones/test/e2e.TestMain(0x10c20f37eb6e55c0?)
/go/src/agones.dev/agones/test/e2e/main_test.go:94 +0xaa5
main.main()
_testmain.go:241 +0x308
goroutine 656 [select]:
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext({0x2db0398, 0xc00028c380}, {0x2da2678?, 0xc00012e240}, 0x1, 0x0, 0x0?)
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:91 +0x3d3
k8s.io/apimachinery/pkg/util/wait.PollUntilContextTimeout({0x2db0248, 0x3c8a900}, 0x77359400, 0x1?, 0x1?, 0x5?)
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:48 +0xd9
agones.dev/agones/test/e2e/framework.(*Framework).WaitForFleetCondition(0xc0003f4ab0, 0xc000524b60, 0xc0001df200, 0xc000aa1d40)
/go/src/agones.dev/agones/test/e2e/framework/framework.go:353 +0x365
agones.dev/agones/test/e2e/framework.(*Framework).AssertFleetCondition(0x2d95540?, 0xc000524b60?, 0xc0001df200, 0x0?)
/go/src/agones.dev/agones/test/e2e/framework/framework.go:345 +0x5a
agones.dev/agones/test/e2e.TestRestAllocatorWithDeprecatedRequired(0xc000524b60)
/go/src/agones.dev/agones/test/e2e/allocator_test.go:252 +0x325
testing.tRunner(0xc000524b60, 0x2b1a110)
/usr/local/go/src/testing/testing.go:1595 +0x262
created by testing.(*T).Run in goroutine 1
/usr/local/go/src/testing/testing.go:1648 +0x846
Retry might be working in some cases, but this panic is definitely stopping retries -- and 30m is a long time for a 5 GameServer fleet not to spin up. It's very odd.
Some links to failed builds:
- https://console.cloud.google.com/cloud-build/builds/165f77fd-b77a-45c0-a59e-f43c4b745f94;step=1?e=13803378&mods=logs_tg_prod&project=agones-images
- https://console.cloud.google.com/cloud-build/builds/270b2b68-5d66-4845-9db2-1258664bff28?project=258182270954
- https://console.cloud.google.com/cloud-build/builds/9afe197d-5c53-4078-8e83-18058b1c7fd9?project=258182270954
Just looking in the cluster, I do see:
markmandel@cloudshell:~ (agones-images)$ kubectl get ns
NAME STATUS AGE
1711579203 Terminating 8d
1711696994 Terminating 7d10h
1711795766 Terminating 6d7h
1712076562 Terminating 3d1h
1712340564 Active 6m34s
agones-system Active 17d
default Active 43d
gke-gmp-system Active 43d
gke-managed-cim Active 43d
gke-managed-filestorecsi Active 43d
gke-managed-system Active 43d
gmp-public Active 43d
kube-node-lease Active 43d
kube-public Active 43d
kube-system Active 43d
...which is weird.
markmandel@cloudshell:~ (agones-images)$ kubectl describe ns 1711579203
Name: 1711579203
Labels: kubernetes.io/metadata.name=1711579203
owner=e2e-test
Annotations: <none>
Status: Terminating
Conditions:
Type Status LastTransitionTime Reason Message
---- ------ ------------------ ------ -------
NamespaceDeletionDiscoveryFailure False Fri, 05 Apr 2024 16:43:44 +0000 ResourcesDiscovered All resources successfully discovered
NamespaceDeletionGroupVersionParsingFailure False Wed, 27 Mar 2024 23:07:22 +0000 ParsedGroupVersions All legacy kube types successfully parsed
NamespaceDeletionContentFailure True Wed, 27 Mar 2024 23:07:22 +0000 ContentDeletionFailed Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: 1711579203 for gvr: /v1, Resource=pods
NamespaceContentRemaining True Wed, 27 Mar 2024 23:07:22 +0000 SomeResourcesRemain Some resources are remaining: gameservers.agones.dev has 12 resource instances, pods. has 12 resource instances
NamespaceFinalizersRemaining True Wed, 27 Mar 2024 23:07:22 +0000 SomeFinalizersRemain Some content in the namespace has finalizers remaining: agones.dev in 12 resource instances
markmandel@cloudshell:~ (agones-images)$ kubectl get pods -n 1711579203
NAME READY STATUS RESTARTS AGE
preferred-w4gnw-dkjjx-5s6bs 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-8h7wx 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-bkd7m 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-c4bdk 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-dj84f 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-gjhw7 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-npfvd 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-qnp8n 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-sl5qm 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-tmdqt 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-tx5wf 0/2 Terminating 0 8d
preferred-w4gnw-dkjjx-wcs2m 0/2 Terminating 0 8d
markmandel@cloudshell:~ (agones-images)$ kubectl get pod -n 1711579203 preferred-w4gnw-dkjjx-5s6bs -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
agones.dev/container: game-server
agones.dev/ready-container-id: containerd://f39362db1ec461c97ae03e1ca5d20df5d946c107d425ee1dd20bdebac57ffc78
agones.dev/sdk-version: 1.40.0-dev-5c2e964
autopilot.gke.io/host-port-assignment: '{"min":7000,"max":8000,"portsAssigned":{"1":7240}}'
autopilot.gke.io/primary-container: game-server
autopilot.gke.io/resource-adjustment: '{"input":{"containers":[{"requests":{"cpu":"30m"},"name":"agones-gameserver-sidecar"},{"limits":{"cpu":"30m","memory":"32Mi"},"requests":{"cpu":"30m","memory":"32Mi"},"name":"game-server"}]},"output":{"containers":[{"limits":{"ephemeral-storage":"1Gi"},"requests":{"cpu":"30m","ephemeral-storage":"1Gi","memory":"2Gi"},"name":"agones-gameserver-sidecar"},{"limits":{"cpu":"470m","ephemeral-storage":"1Gi","memory":"62Mi"},"requests":{"cpu":"470m","ephemeral-storage":"1Gi","memory":"62Mi"},"name":"game-server"}]},"modified":true}'
autopilot.gke.io/warden-version: 2.9.20
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
creationTimestamp: "2024-03-27T22:54:37Z"
deletionGracePeriodSeconds: 30
deletionTimestamp: "2024-03-27T23:06:09Z"
labels:
agones.dev/gameserver: preferred-w4gnw-dkjjx-5s6bs
agones.dev/role: gameserver
agones.dev/safe-to-evict: "false"
name: preferred-w4gnw-dkjjx-5s6bs
namespace: "1711579203"
ownerReferences:
- apiVersion: agones.dev/v1
blockOwnerDeletion: true
controller: true
kind: GameServer
name: preferred-w4gnw-dkjjx-5s6bs
uid: ec61fed6-63c7-4c80-9a85-9e7dc70d5015
resourceVersion: "47363486"
uid: d608d658-758d-4a07-960b-7ada2526a6ad
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
agones.dev/role: gameserver
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- args:
- --grpc-port=9357
- --http-port=9358
env:
- name: GAMESERVER_NAME
value: preferred-w4gnw-dkjjx-5s6bs
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: FEATURE_GATES
value: CountsAndLists=true&DisableResyncOnSDKServer=false&Example=true&FleetAllocationOverflow=false&GKEAutopilotExtendedDurationPods=true&PlayerAllocationFilter=true&PlayerTracking=true
- name: LOG_LEVEL
value: Info
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.40.0-dev-5c2e964
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 3
successThreshold: 1
timeoutSeconds: 1
name: agones-gameserver-sidecar
resources:
limits:
ephemeral-storage: 1Gi
requests:
cpu: 30m
ephemeral-storage: 1Gi
memory: 2Gi
securityContext:
capabilities:
drop:
- NET_RAW
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-8jf5m
readOnly: true
- env:
- name: AGONES_SDK_GRPC_PORT
value: "9357"
- name: AGONES_SDK_HTTP_PORT
value: "9358"
image: us-docker.pkg.dev/agones-images/examples/simple-game-server:0.28
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /gshealthz
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: game-server
ports:
- containerPort: 7654
hostPort: 7240
protocol: UDP
resources:
limits:
cpu: 470m
ephemeral-storage: 1Gi
memory: 62Mi
requests:
cpu: 470m
ephemeral-storage: 1Gi
memory: 62Mi
securityContext:
capabilities:
drop:
- NET_RAW
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: empty
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: preferred-w4gnw-dkjjx-5s6bs
nodeName: gk3-gke-autopilot-e2e-te-nap-qpzyqh5z-032f3139-lcj6
nodeSelector:
cloud.google.com/extended-duration-pods: "0"
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: gke.io/optimize-utilization-scheduler
securityContext:
seccompProfile:
type: Unconfined
serviceAccount: agones-sdk
serviceAccountName: agones-sdk
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: kubernetes.io/arch
operator: Equal
value: amd64
volumes:
- emptyDir: {}
name: empty
- name: kube-api-access-8jf5m
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-27T23:10:10Z"
status: "False"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2024-03-27T22:57:02Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-03-27T23:10:10Z"
message: 'containers with unready status: [agones-gameserver-sidecar game-server]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-03-27T23:10:10Z"
message: 'containers with unready status: [agones-gameserver-sidecar game-server]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-03-27T22:57:02Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://72f4fa8e6a90b882fe3be9eac808c0beed140feb526b22790f446fac46505c7a
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.40.0-dev-5c2e964
imageID: us-docker.pkg.dev/agones-images/ci/agones-sdk@sha256:9aa78a749ccdf2b9b6c05484066e2d6f58e40843e87b218ec17cade8c04e267d
lastState: {}
name: agones-gameserver-sidecar
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://72f4fa8e6a90b882fe3be9eac808c0beed140feb526b22790f446fac46505c7a
exitCode: 137
finishedAt: "2024-03-27T23:06:09Z"
reason: Error
startedAt: "2024-03-27T22:57:12Z"
- containerID: containerd://f39362db1ec461c97ae03e1ca5d20df5d946c107d425ee1dd20bdebac57ffc78
image: us-docker.pkg.dev/agones-images/examples/simple-game-server:0.28
imageID: us-docker.pkg.dev/agones-images/examples/simple-game-server@sha256:81d04419aa309fdb553a13be0614540da64a7df2ff9701f38a35c68398fe92e8
lastState: {}
name: game-server
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://f39362db1ec461c97ae03e1ca5d20df5d946c107d425ee1dd20bdebac57ffc78
exitCode: 0
finishedAt: "2024-03-27T23:05:39Z"
reason: Completed
startedAt: "2024-03-27T22:57:17Z"
hostIP: 10.132.15.220
hostIPs:
- ip: 10.132.15.220
phase: Running
podIP: 10.64.134.69
podIPs:
- ip: 10.64.134.69
qosClass: Burstable
startTime: "2024-03-27T22:57:02Z"
So the pods are stuck for some reason?
kubectl delete pod --all --force --grace-period=0 -n <namespace>
did work to cleanup the Pods though.
I did recently notice:
markmandel@cloudshell:~ (agones-images)$ kubectl get crds | grep agones
fleetautoscalers.autoscaling.agones.dev 2024-04-03T07:47:43Z
fleets.agones.dev 2024-04-03T07:47:42Z
gameserverallocationpolicies.multicluster.agones.dev 2024-04-03T07:47:43Z
gameserversets.agones.dev 2024-04-03T07:47:44Z
But that seems to be after cleanup happened? GameServers
where definitely there before.
Can CRDs go into a Terminating state?
Tracking this in an internal bug, will report back.
Resolved. Will close.