agones
agones copied to clipboard
Contentions errors from allocator gRPC service under high load
What happened:
Performance testing revealed that agones-allocator
service, returns contention errors under heavy load when there are multiple replicas for the service.
What you expected to happen: The service should allocate game servers under heavy load with no contention error.
How to reproduce it (as minimally and precisely as possible): Run a performance test with 50 parallel clients and 4000 gameservers.
Anything else we need to know?:
The agones-allocator
pods are caching game servers in memory. Because the state of game servers are changed by different pods, the cache could quickly go out of sync with the state of game servers in the cluster. Either the cache should be changed to a key-value store shared between the pods or allocators watch game server changes and use k8s API to get a ready game server using its labels without caching them.
Documentations should also provide recommendations on the # of replica for packed and distributed allocation.
Environment:
- Agones version: 1.9.0
- Kubernetes version (use
kubectl version
): 1.16 - Cloud provider or hardware configuration: GKE
- Install method (yaml/helm): Helm
- Troubleshooting guide log(s):
- Others:
I'm assuming this is primarily a problem when we are using the Packing algorithm for allocation (would be good to have performance metrics on Distributed vs Packed error rate and throughput), as since we are trying to bin pack the Allocated game servers, which means that the sorted caching on each allocator binary will generally target the same GameServers.
Some thoughts on this:
- Doing some local performance testing, we can get ~100+ allocations a second (with a quick fix to #1852, PR coming soon). Is this acceptable performance. If so, we could leave this issue alone, and say if you want more, you need to run more clusters.
- I also expect that putting some light randomisation in the allocator (probably here) e.g. take the first n/% that are sorted, but randomise them before attempting the allocation, would reduce the amount of contention considerably (maybe n defaults to allocation replica count + 1 for randomisation?). The downside here is we lose some tightness on the packing. This might be a configuration knob we can provide where the user can choose the tradeoff of throughput vs bin packing.
I must say - this has taken me on quite travel down memory lane to remember how allocation works!
One area of research we should also confirm -- how much is the gRPC endpoint actually loadbalanced?
It's quite possible there is only one pod being used at any given point and time with the current load balancing setup.
Another thought, at this point: https://github.com/googleforgames/agones/blob/master/pkg/gameserverallocations/allocator.go#L547
Instead have it be:
gs, err := c.readyGameServerCache.PatchGameServerMetadata(res.request.gsa.Spec.MetaPatch, res.gs)
if err != nil {
// since we could not allocate, we should put it back
if !k8serrors.IsConflict(err) { // this is the new bit
c.readyGameServerCache.AddToReadyGameServer(gs)
}
res.err = errors.Wrap(err, "error updating allocated gameserver")
} else {
res.gs = gs
c.recorder.Event(res.gs, corev1.EventTypeNormal, string(res.gs.Status.State), "Allocated")
}
Basically, if there is a conflict, let the actual version re-populate the catch from the K8s watch operation, since we know that this version of the GameServer is stale, since it's conflicted on the update.
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions