kruise-game
kruise-game copied to clipboard
Occasional "NotReady" Network Status on Pod Upon Rebuilding a GameServerSet
We are experiencing an issue where, upon updating our GameServerSet (GSS), which causes all managed Pods to rebuild, there's an occurrence of Pods (out of the 6 running GameServers) ending up with a failure in retrieving network information, resulting in a "NotReady" network status. Below are the specific details and steps that lead to this issue:
Environment:
Network Plugin: HostPort Number of GameServer replica in the GSS: 6
Steps to Reproduce:
- Update the GSS by changing the container image and environment variables. This action triggers a rebuild of all Pods managed by the GSS.
- After the old Pods are deleted and new ones are recreated, one of the six Pods encounters an error in obtaining network information.
Expected Behavior:
After the update and subsequent Pod recreation, all Pods should successfully retrieve their network information and display a "Ready" network status.
Log informantion
I am observing logs from kruise-game-manager that warrant attention. Here are the specific log entries:
2024-01-26T14:59:46+08:00 I0126 06:59:46.237778 1 hostPort.go:73] Receiving pod dev/gs-dev-a4-3 ADD Operation
2024-01-26T14:59:46+08:00 I0126 06:59:46.237840 1 hostPort.go:80] There is a pod with same ns/name(dev/gs-dev-a4-3) exists in cluster, do not allocate
When pod recreate, network plugin(Webhook) receive both DELETE and ADD Operation. However, old pod was still in the cluster, so ADD Operation will be failed, Webhook would not patch the ports on pod.
We plan to refract the Webhook Mutating machinism for network plugin. Here the plan: When Webhook get plugin error, it will deny the request, request will be created util no error generated.
Newest Version fixed that.