[Feature Proposal] - Improve allocator HA to favor packing sessions into multi sessions servers
Is your feature request related to a problem? Please describe. The current HA solution for the allocator is implemented with a service and at least 2 replicas of the allocator that process the received gsa at intervals of time.
This approach is not well suited for high allocations rate and high capacity servers because the allocator compete to update the same server CRD thus generating update failures and many retries leading eventually to spreading of the sessions on multiple servers and high allocations time. Even with one session per server, the allocators can compete on the same server.
Using one allocator (especially with batching enabled) with a fine tuned batch wait time gives better results but the HA policy is downgraded to just restarting the pod when needed.
The scenarios we need to support would be:
- one node goes down with the allocator on it
- planned maintenance when the allocator is restarted to be updated
- allocator crashes
Describe the solution you'd like The perfect solution would be to find a design where multiple allocators are running, sharing the load when needed.
Describe alternatives you've considered Possible solutions:
- master/slave allocators implemented using the readiness check of each pod and a leader election scheme
- pub/sub instead of channel for consuming the allocations
- shared cache and list for servers state between allocators
Additional context Add any other context or screenshots about the feature request here.
Link to the Agones Feature Proposal (if any) None
Discussion Link (if any) There have been discussions here: https://github.com/googleforgames/agones/pull/4176#issuecomment-2892822957
I currently have a leader election, it set a label to the leader pod (as I can't really use the readiness check because it would fail the helm install) the service will only look for leader label. It's working well for our usecase but it's not ideal, it feels pretty hacky..
I will try to find some time to implement another solution:
K8S Leader election and olric pub sub + dmap Non leader: would publish to the pub sub and subscribe to a dmap for the response Leader: would subscribe to the pub sub and publish to the dmap for response
This way, we can have as many allocator as we wants, they would be behind leader election, only one pod would handles the batch allocation and the other would works as "proxy" (they would all be behind the service)
What do you think about it ? @markmandel @miai10
From my understanding and discussions with you, the leader election + annotation will have some seconds of downtime during maintenance (service detecting new pods) but is easier to implement, whereas the leader election + pub sub will deliver no downtime but it is more intrusive in code, right?
Please add pros/cons, if you have more.
I would vote for the no downtime solution as an investment for the future.
Had an offline DM with @lacroixthomas yesterday chatting about a few options - I think this is probably a good summary of where we landed. @lacroixthomas please correct me if I got anything wrong:
┌──────────────────────────────────────────────────┐
│ │
│ Load Balancer │
│ │
└───────────┬───────────────┬───────────┬──────────┘
│ │ │
┌───────────┴──┐ ┌─────────┴────┐ ┌───┴──────────┐
│ │ │ │ │ │
│ Allocator │ │ Allocator │ │ Allocator │
│ │ │ │ │ │
└────────┬─────┘ └───────────┬──┘ └──┬───────────┘
│ │ │
└────────────────────┤ │ ◄──────────Processor pulls from allocator
Hot spare │ │
│ │ │
│ ┌───────────────┐ ┌───┴────────┴──┐
│ │ │ │ │ Leader elected
└──────────► │ Processor │ │ Processor │ ◄──────────────────┘
│ │ │ │
└───────────────┘ └───────────────┘
Basically like we we've been talking about here and a few other places.
- Requests come into through the Allocator (or
GameServerAllocation, but I don't think we need to draw that out). - We essentially pub sub from the Allocators to the Processor (or whatever name we want to give it. Sacrificial draft for now).
- This does mean that the Processor is working on a pull model. I expect we'll pull on what we now have as a batch interval, which by default is 500ms but is configurable.
- The singular leader elected Processor is the one that takes that batch from all Allocators, makes the list from the cache, runs the allocations across the whole list and then returns the results back up to the allocator to return to the clients.
- This means we'll no longer conflict with other allocators. Yes, we can only horizontally scale, but a processor is pretty lightweight CPU wise, and I expect we'll hit throttling on the K8s control plane well before we hit CPU limits.
Some extra notes from the conversation we had:
- The pull model is a concern on processing time, but it gives us HA, and I think will enable us to batch pretty effectively.
- olric looks like potential overkill. To keep latency down, using a gRPC approach with a bidirectional stream would probably be way faster. It does mean we're building out some pubsub mechanics though (maybe spin out into it's own lib one day? one already exists?)
- HA would happen such that, if a leader goes down, the Allocator publish queue can have a ttl on processing time and can retry within the 30s window the API extension gives us to process requests, at which point the leader should end up switching over and picking up where the previous leader left off. There's definitely some trickiness there - so we might want to start with something that fails early and recovers quickly, as a first option, then slowly build out retry mechanisms once we have the base solution.
I think this is a good architecture, but thoughts and feedback definitely appreciated (and again - correct me if I mis-remember or got anything wrong or missing).
Everything looks good ! Thank you for taking the time to bring everything together here with a diagram 😄
I'm playing with the pull model with gRPC streaming, to check how things would works with leader / pulling etc. It's just a dummy test outside of agones so far, but once all good I'll try to integrate that in agones
Great job, guys! I'll keep this in the back of my mind and let you know if I see something worth discussing. Nice programmer art!
I think the proxy allocators can do the validation work still, for example up to here https://github.com/googleforgames/agones/blob/main/pkg/gameserverallocations/allocator.go#L224.
Sketch of PRs to make to split the code and ease the review, this list is not set in stone, could change, this is just to give an idea of what's left:
- [x] PR 1: Add
ProcessorAllocatorfeature gate to runtime configuration. - [x] PR 2: Create new
processordeployment with leader election (behind feature gate). - [x] PR 3: Add gRPC proto definition (behind feature gate).
- [ ] PR 4: Integrate gRPC server in processor.
- [x] PR 5: Setup gRPC client to be used later on in allocator and processor
- [x] PR 6: Integrate gRPC client in allocator to call processor (behind feature gate).
- [x] PR 7: Integrate gRPC client in extension controller to call processor (behind feature gate).
- [ ] PR 8: Add metrics
- [ ] PR 9: Write documentation (behind feature gate)
I'm adding some diagrams following some feedback on a PR, what do you think about it @markmandel @miai10 ?
How the client connect to the processor is not set in stone, but it's a possible solution
How the pull effect streaming using gRPC works:
sequenceDiagram
participant Client as Client (Forwarder)
participant Server as Server (Processor)
Note over Client, Server: Bidirectional gRPC Stream (StreamBatches)
Client->>Server: ProcessorMessage { client_id: "fwd-123" }
Note right of Client: Client registers with unique ID
loop Pull-based processing cycle
Server->>Client: ProcessorMessage { client_id: "fwd-123", pull: PullRequest{} }
Note left of Server: Server pulls for work batch
alt Has pending requests
Client->>Server: ProcessorMessage { client_id: "fwd-123", batch: Batch{requests: [req1, req2, ...]} }
Note right of Client: Send batch of allocation requests
Note over Server: Process each allocation request
Server->>Client: ProcessorMessage { client_id: "fwd-123", batch_response: BatchResponse{responses: [...], errors: []} }
Note left of Server: Return batch responses
end
Note over Server: Wait before next pull request
end
Note over Client, Server: Stream remains open for continuous pulling
Client->>Server: Close stream
Server-->>Client: Stream closed
Possible directions to connect from a "client" to the processor (not set in stone, the implementation is not started):
sequenceDiagram
participant Client as Client
participant Config as Config
participant K8sAPI as Kubernetes API
participant Lease as Leader Lease
participant Service as Service Discovery
participant Server as Server
Client->>Config: Check leader election enabled?
alt Leader election enabled (>1 replicas) )
Note over Client, Lease: Watch-based leader discovery
Client->>K8sAPI: Watch lease object
K8sAPI->>Lease: Get current leader info
Lease-->>K8sAPI: Leader IP/endpoint
K8sAPI-->>Client: Leader server IP
Client->>Server: Connect to leader IP
Note right of Client: Establish gRPC connection
loop Continuous watching
K8sAPI->>Client: Lease change event
Note over Client: Leader changed
Client->>Client: Close existing connection
Client->>Server: Connect to new leader IP
end
else Leader election disabled (1 replica)
Note over Client, Service: Direct service discovery
Client->>Service: Resolve service name
Service-->>Client: Service IP/endpoint
Client->>Server: Connect to service IP
Note right of Client: Establish gRPC connection
loop Health monitoring
Client->>Service: Check service health
alt Service changed/unhealthy
Client->>Client: Close existing connection
Client->>Service: Resolve service name
Service-->>Client: New service IP
Client->>Server: Connect to new service IP
end
end
end
Note over Client, Server: Connection established and monitored```
Actually got another idea about how to get the leader IP from the client, something way simpler and cleaner, will draw a little something soon
@markmandel @miai10
Alright ! New design to track the leader IP (with 1 or any replicas), which should be simpler and use the service itself 😄
The processor will always start a gRPC server, all the processor pods would be ready (1 replicas or x replicas), we want the pod to always be ready to avoid failing helm install with health check errors
The processor got a leader election, only the leader would have a healthy gRPC server SERVING, the other one would be in NOT_SERVING: https://grpc.io/docs/guides/health-checking/#the-server-side-health-service
healthServer := health.NewServer()
healthServer.SetServingStatus("processor", healthpb.HealthCheckResponse_SERVING) // Leader
healthServer.SetServingStatus("processor", healthpb.HealthCheckResponse_NOT_SERVING) // Non-leader
healthpb.RegisterHealthServer(grpcServer, healthServer)
This should not make the pod health check in error, it would still be ready
The allocator and the extensions, will connect to the leader by using it's service and by using a health check on it, which would only use the SERVING gRPC server, eg:
conn, err := grpc.Dial("processor-service:9090",
grpc.WithDefaultServiceConfig(`{
"loadBalancingConfig": [{"round_robin": {}}],
"healthCheckConfig": {"serviceName": "processor"}
}`),
)
sequenceDiagram
participant Allocator
participant ProcessorService
participant Processor1
participant Processor2
Allocator->>ProcessorService: gRPC.Dial("processor-service") + HealthCheck
Note over Processor1, Processor2: Leader Election
Processor1->>Processor1: Becomes Leader, Set gRPC to SERVING
Processor2->>Processor2: Remains Non-Leader, Set gRPC to NOT_SERVING
ProcessorService->>Processor1: Health Check (SERVING)
ProcessorService->>Processor2: Health Check (NOT_SERVING)
ProcessorService->>Allocator: Route to Healthy Leader
Note over Processor1, Processor2: Leader Failover
Processor1->>Processor1: Leader Fails
Processor2->>Processor2: Becomes New Leader, Set gRPC to SERVING
ProcessorService->>Allocator: Auto-failover to New Leader
Got a prototype of the processor implementation with a dummy client using this design working, the only thing is that the client will need to retry until it found the right SERVING health check, I don't think that it's an issue, it's still really fast (around 3ms)
Will continue the implementation of the processor with gRPC and the allocator + extension, we might soon have it all working all together !
Can't wait for us to start some load test (with and without the batch allocator from the other PR) 😄 With some luck, it will be ready to be in Alpha for the release of September 🤞🏼
Hi! I just had the time to check the diagrams (nice work, they helped me a lot). The latest design seems indeed more elegant.
Is the processor-service a k8s service that discovers all the processor pods? When we do the maintenance to upgrade the processors, would we be able to shutdown all non leader processor, spawn new processor and then force a leader election?
For running load tests, we might need this too: https://github.com/googleforgames/agones/issues/4192. I'll find some time to work on it in the coming days if you guys consider it useful for the tests.
Is the processor-service a k8s service that discovers all the processor pods? When we do the maintenance to upgrade the processors, would we be able to shutdown all non leader processor, spawn new processor and then force a leader election?
If the gRPC retries are setup correctly, it shouldn't be dependent on leader election too much - it'll client side retry until the process finishes or is done. (least that would be my theory).
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
Added awaiting-maintainer so this does stale out. Work is ongoing.