agones icon indicating copy to clipboard operation
agones copied to clipboard

Contentions errors from allocator gRPC service under high load

Open pooneh-m opened this issue 4 years ago • 8 comments

What happened: Performance testing revealed that agones-allocator service, returns contention errors under heavy load when there are multiple replicas for the service.

What you expected to happen: The service should allocate game servers under heavy load with no contention error.

How to reproduce it (as minimally and precisely as possible): Run a performance test with 50 parallel clients and 4000 gameservers.

Anything else we need to know?: The agones-allocator pods are caching game servers in memory. Because the state of game servers are changed by different pods, the cache could quickly go out of sync with the state of game servers in the cluster. Either the cache should be changed to a key-value store shared between the pods or allocators watch game server changes and use k8s API to get a ready game server using its labels without caching them.

Documentations should also provide recommendations on the # of replica for packed and distributed allocation.

Environment:

  • Agones version: 1.9.0
  • Kubernetes version (use kubectl version): 1.16
  • Cloud provider or hardware configuration: GKE
  • Install method (yaml/helm): Helm
  • Troubleshooting guide log(s):
  • Others:

pooneh-m avatar Oct 20 '20 19:10 pooneh-m

I'm assuming this is primarily a problem when we are using the Packing algorithm for allocation (would be good to have performance metrics on Distributed vs Packed error rate and throughput), as since we are trying to bin pack the Allocated game servers, which means that the sorted caching on each allocator binary will generally target the same GameServers.

Some thoughts on this:

  • Doing some local performance testing, we can get ~100+ allocations a second (with a quick fix to #1852, PR coming soon). Is this acceptable performance. If so, we could leave this issue alone, and say if you want more, you need to run more clusters.
  • I also expect that putting some light randomisation in the allocator (probably here) e.g. take the first n/% that are sorted, but randomise them before attempting the allocation, would reduce the amount of contention considerably (maybe n defaults to allocation replica count + 1 for randomisation?). The downside here is we lose some tightness on the packing. This might be a configuration knob we can provide where the user can choose the tradeoff of throughput vs bin packing.

I must say - this has taken me on quite travel down memory lane to remember how allocation works!

markmandel avatar Oct 20 '20 19:10 markmandel

One area of research we should also confirm -- how much is the gRPC endpoint actually loadbalanced?

It's quite possible there is only one pod being used at any given point and time with the current load balancing setup.

markmandel avatar Oct 27 '20 18:10 markmandel

Another thought, at this point: https://github.com/googleforgames/agones/blob/master/pkg/gameserverallocations/allocator.go#L547

Instead have it be:

					gs, err := c.readyGameServerCache.PatchGameServerMetadata(res.request.gsa.Spec.MetaPatch, res.gs)
					if err != nil {
						// since we could not allocate, we should put it back
						if !k8serrors.IsConflict(err) { // this is the new bit
                                                   c.readyGameServerCache.AddToReadyGameServer(gs)
                                                }
						res.err = errors.Wrap(err, "error updating allocated gameserver")
					} else {
						res.gs = gs
						c.recorder.Event(res.gs, corev1.EventTypeNormal, string(res.gs.Status.State), "Allocated")
					}

Basically, if there is a conflict, let the actual version re-populate the catch from the K8s watch operation, since we know that this version of the GameServer is stale, since it's conflicted on the update.

markmandel avatar Nov 10 '20 19:11 markmandel

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Sep 15 '23 10:09 github-actions[bot]

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Nov 01 '23 10:11 github-actions[bot]

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Dec 15 '23 10:12 github-actions[bot]

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Jan 15 '24 10:01 github-actions[bot]

This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions

github-actions[bot] avatar Feb 15 '24 02:02 github-actions[bot]