agones icon indicating copy to clipboard operation
agones copied to clipboard

[Feature Proposal] - Improve allocator HA to favor packing sessions into multi sessions servers

Open miai10 opened this issue 7 months ago • 15 comments

Is your feature request related to a problem? Please describe. The current HA solution for the allocator is implemented with a service and at least 2 replicas of the allocator that process the received gsa at intervals of time.

This approach is not well suited for high allocations rate and high capacity servers because the allocator compete to update the same server CRD thus generating update failures and many retries leading eventually to spreading of the sessions on multiple servers and high allocations time. Even with one session per server, the allocators can compete on the same server.

Using one allocator (especially with batching enabled) with a fine tuned batch wait time gives better results but the HA policy is downgraded to just restarting the pod when needed.

The scenarios we need to support would be:

  • one node goes down with the allocator on it
  • planned maintenance when the allocator is restarted to be updated
  • allocator crashes

Describe the solution you'd like The perfect solution would be to find a design where multiple allocators are running, sharing the load when needed.

Describe alternatives you've considered Possible solutions:

  1. master/slave allocators implemented using the readiness check of each pod and a leader election scheme
  2. pub/sub instead of channel for consuming the allocations
  3. shared cache and list for servers state between allocators

Additional context Add any other context or screenshots about the feature request here.

Link to the Agones Feature Proposal (if any) None

Discussion Link (if any) There have been discussions here: https://github.com/googleforgames/agones/pull/4176#issuecomment-2892822957

miai10 avatar May 27 '25 08:05 miai10

I currently have a leader election, it set a label to the leader pod (as I can't really use the readiness check because it would fail the helm install) the service will only look for leader label. It's working well for our usecase but it's not ideal, it feels pretty hacky..

I will try to find some time to implement another solution:

K8S Leader election and olric pub sub + dmap Non leader: would publish to the pub sub and subscribe to a dmap for the response Leader: would subscribe to the pub sub and publish to the dmap for response

This way, we can have as many allocator as we wants, they would be behind leader election, only one pod would handles the batch allocation and the other would works as "proxy" (they would all be behind the service)

What do you think about it ? @markmandel @miai10

lacroixthomas avatar Jun 22 '25 22:06 lacroixthomas

From my understanding and discussions with you, the leader election + annotation will have some seconds of downtime during maintenance (service detecting new pods) but is easier to implement, whereas the leader election + pub sub will deliver no downtime but it is more intrusive in code, right?

Please add pros/cons, if you have more.

I would vote for the no downtime solution as an investment for the future.

miai10 avatar Jun 25 '25 09:06 miai10

Had an offline DM with @lacroixthomas yesterday chatting about a few options - I think this is probably a good summary of where we landed. @lacroixthomas please correct me if I got anything wrong:

               ┌──────────────────────────────────────────────────┐                              
               │                                                  │                              
               │                Load Balancer                     │                              
               │                                                  │                              
               └───────────┬───────────────┬───────────┬──────────┘                              
                           │               │           │                                         
               ┌───────────┴──┐  ┌─────────┴────┐  ┌───┴──────────┐                              
               │              │  │              │  │              │                              
               │  Allocator   │  │  Allocator   │  │  Allocator   │                              
               │              │  │              │  │              │                              
               └────────┬─────┘  └───────────┬──┘  └──┬───────────┘                              
                        │                    │        │                                          
                        └────────────────────┤        │ ◄──────────Processor pulls from allocator
Hot spare                                    │        │                                          
        │                                    │        │                                          
        │            ┌───────────────┐   ┌───┴────────┴──┐                                       
        │            │               │   │               │              Leader elected           
        └──────────► │   Processor   │   │   Processor   │ ◄──────────────────┘                  
                     │               │   │               │                                       
                     └───────────────┘   └───────────────┘                                       

Basically like we we've been talking about here and a few other places.

  • Requests come into through the Allocator (or GameServerAllocation, but I don't think we need to draw that out).
  • We essentially pub sub from the Allocators to the Processor (or whatever name we want to give it. Sacrificial draft for now).
    • This does mean that the Processor is working on a pull model. I expect we'll pull on what we now have as a batch interval, which by default is 500ms but is configurable.
  • The singular leader elected Processor is the one that takes that batch from all Allocators, makes the list from the cache, runs the allocations across the whole list and then returns the results back up to the allocator to return to the clients.
    • This means we'll no longer conflict with other allocators. Yes, we can only horizontally scale, but a processor is pretty lightweight CPU wise, and I expect we'll hit throttling on the K8s control plane well before we hit CPU limits.

Some extra notes from the conversation we had:

  1. The pull model is a concern on processing time, but it gives us HA, and I think will enable us to batch pretty effectively.
  2. olric looks like potential overkill. To keep latency down, using a gRPC approach with a bidirectional stream would probably be way faster. It does mean we're building out some pubsub mechanics though (maybe spin out into it's own lib one day? one already exists?)
  3. HA would happen such that, if a leader goes down, the Allocator publish queue can have a ttl on processing time and can retry within the 30s window the API extension gives us to process requests, at which point the leader should end up switching over and picking up where the previous leader left off. There's definitely some trickiness there - so we might want to start with something that fails early and recovers quickly, as a first option, then slowly build out retry mechanisms once we have the base solution.

I think this is a good architecture, but thoughts and feedback definitely appreciated (and again - correct me if I mis-remember or got anything wrong or missing).

markmandel avatar Jun 28 '25 22:06 markmandel

Everything looks good ! Thank you for taking the time to bring everything together here with a diagram 😄

I'm playing with the pull model with gRPC streaming, to check how things would works with leader / pulling etc. It's just a dummy test outside of agones so far, but once all good I'll try to integrate that in agones

lacroixthomas avatar Jun 29 '25 21:06 lacroixthomas

Great job, guys! I'll keep this in the back of my mind and let you know if I see something worth discussing. Nice programmer art!

I think the proxy allocators can do the validation work still, for example up to here https://github.com/googleforgames/agones/blob/main/pkg/gameserverallocations/allocator.go#L224.

miai10 avatar Jun 30 '25 08:06 miai10

Sketch of PRs to make to split the code and ease the review, this list is not set in stone, could change, this is just to give an idea of what's left:

  • [x] PR 1: Add ProcessorAllocator feature gate to runtime configuration.
  • [x] PR 2: Create new processor deployment with leader election (behind feature gate).
  • [x] PR 3: Add gRPC proto definition (behind feature gate).
  • [ ] PR 4: Integrate gRPC server in processor.
  • [x] PR 5: Setup gRPC client to be used later on in allocator and processor
  • [x] PR 6: Integrate gRPC client in allocator to call processor (behind feature gate).
  • [x] PR 7: Integrate gRPC client in extension controller to call processor (behind feature gate).
  • [ ] PR 8: Add metrics
  • [ ] PR 9: Write documentation (behind feature gate)

lacroixthomas avatar Jul 18 '25 21:07 lacroixthomas

I'm adding some diagrams following some feedback on a PR, what do you think about it @markmandel @miai10 ?

How the client connect to the processor is not set in stone, but it's a possible solution

How the pull effect streaming using gRPC works:

sequenceDiagram
    participant Client as Client (Forwarder)
    participant Server as Server (Processor)
    
    Note over Client, Server: Bidirectional gRPC Stream (StreamBatches)
    
    Client->>Server: ProcessorMessage { client_id: "fwd-123" }
    Note right of Client: Client registers with unique ID
    
    loop Pull-based processing cycle
        Server->>Client: ProcessorMessage { client_id: "fwd-123", pull: PullRequest{} }
        Note left of Server: Server pulls for work batch
        
        alt Has pending requests
            Client->>Server: ProcessorMessage { client_id: "fwd-123", batch: Batch{requests: [req1, req2, ...]} }
            Note right of Client: Send batch of allocation requests
            
            Note over Server: Process each allocation request
            
            Server->>Client: ProcessorMessage { client_id: "fwd-123", batch_response: BatchResponse{responses: [...], errors: []} }
            Note left of Server: Return batch responses
        end
        
        Note over Server: Wait before next pull request
    end
    
    Note over Client, Server: Stream remains open for continuous pulling
    Client->>Server: Close stream
    Server-->>Client: Stream closed

Possible directions to connect from a "client" to the processor (not set in stone, the implementation is not started):

sequenceDiagram
    participant Client as Client
    participant Config as Config
    participant K8sAPI as Kubernetes API
    participant Lease as Leader Lease
    participant Service as Service Discovery
    participant Server as Server

    Client->>Config: Check leader election enabled?
    
    alt Leader election enabled (>1 replicas) )
        Note over Client, Lease: Watch-based leader discovery
        Client->>K8sAPI: Watch lease object
        K8sAPI->>Lease: Get current leader info
        Lease-->>K8sAPI: Leader IP/endpoint
        K8sAPI-->>Client: Leader server IP
        
        Client->>Server: Connect to leader IP
        Note right of Client: Establish gRPC connection
        
        loop Continuous watching
            K8sAPI->>Client: Lease change event
            Note over Client: Leader changed
            Client->>Client: Close existing connection
            Client->>Server: Connect to new leader IP
        end
        
    else Leader election disabled (1 replica)
        Note over Client, Service: Direct service discovery
        Client->>Service: Resolve service name
        Service-->>Client: Service IP/endpoint
        
        Client->>Server: Connect to service IP
        Note right of Client: Establish gRPC connection
        
        loop Health monitoring
            Client->>Service: Check service health
            alt Service changed/unhealthy
                Client->>Client: Close existing connection
                Client->>Service: Resolve service name
                Service-->>Client: New service IP
                Client->>Server: Connect to new service IP
            end
        end
    end
    
    Note over Client, Server: Connection established and monitored```

lacroixthomas avatar Jul 29 '25 13:07 lacroixthomas

Actually got another idea about how to get the leader IP from the client, something way simpler and cleaner, will draw a little something soon

lacroixthomas avatar Aug 01 '25 17:08 lacroixthomas

@markmandel @miai10

Alright ! New design to track the leader IP (with 1 or any replicas), which should be simpler and use the service itself 😄

The processor will always start a gRPC server, all the processor pods would be ready (1 replicas or x replicas), we want the pod to always be ready to avoid failing helm install with health check errors The processor got a leader election, only the leader would have a healthy gRPC server SERVING, the other one would be in NOT_SERVING: https://grpc.io/docs/guides/health-checking/#the-server-side-health-service

healthServer := health.NewServer()
healthServer.SetServingStatus("processor", healthpb.HealthCheckResponse_SERVING) // Leader
healthServer.SetServingStatus("processor", healthpb.HealthCheckResponse_NOT_SERVING) // Non-leader

healthpb.RegisterHealthServer(grpcServer, healthServer)

This should not make the pod health check in error, it would still be ready

The allocator and the extensions, will connect to the leader by using it's service and by using a health check on it, which would only use the SERVING gRPC server, eg:

conn, err := grpc.Dial("processor-service:9090",
    grpc.WithDefaultServiceConfig(`{
        "loadBalancingConfig": [{"round_robin": {}}],
        "healthCheckConfig": {"serviceName": "processor"}
    }`),
)
sequenceDiagram
    participant Allocator
    participant ProcessorService
    participant Processor1
    participant Processor2
    
    Allocator->>ProcessorService: gRPC.Dial("processor-service") + HealthCheck
    
    Note over Processor1, Processor2: Leader Election
    Processor1->>Processor1: Becomes Leader, Set gRPC to SERVING
    Processor2->>Processor2: Remains Non-Leader, Set gRPC to NOT_SERVING
    
    ProcessorService->>Processor1: Health Check (SERVING)
    ProcessorService->>Processor2: Health Check (NOT_SERVING)
    ProcessorService->>Allocator: Route to Healthy Leader
    
    Note over Processor1, Processor2: Leader Failover
    Processor1->>Processor1: Leader Fails
    Processor2->>Processor2: Becomes New Leader, Set gRPC to SERVING
    ProcessorService->>Allocator: Auto-failover to New Leader

lacroixthomas avatar Aug 01 '25 18:08 lacroixthomas

Got a prototype of the processor implementation with a dummy client using this design working, the only thing is that the client will need to retry until it found the right SERVING health check, I don't think that it's an issue, it's still really fast (around 3ms)

Will continue the implementation of the processor with gRPC and the allocator + extension, we might soon have it all working all together !

Can't wait for us to start some load test (with and without the batch allocator from the other PR) 😄 With some luck, it will be ready to be in Alpha for the release of September 🤞🏼

lacroixthomas avatar Aug 03 '25 21:08 lacroixthomas

Hi! I just had the time to check the diagrams (nice work, they helped me a lot). The latest design seems indeed more elegant.

Is the processor-service a k8s service that discovers all the processor pods? When we do the maintenance to upgrade the processors, would we be able to shutdown all non leader processor, spawn new processor and then force a leader election?

miai10 avatar Aug 05 '25 22:08 miai10

For running load tests, we might need this too: https://github.com/googleforgames/agones/issues/4192. I'll find some time to work on it in the coming days if you guys consider it useful for the tests.

miai10 avatar Aug 05 '25 22:08 miai10

Is the processor-service a k8s service that discovers all the processor pods? When we do the maintenance to upgrade the processors, would we be able to shutdown all non leader processor, spawn new processor and then force a leader election?

If the gRPC retries are setup correctly, it shouldn't be dependent on leader election too much - it'll client side retry until the process finishes or is done. (least that would be my theory).

markmandel avatar Aug 12 '25 00:08 markmandel

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Oct 15 '25 10:10 github-actions[bot]

Added awaiting-maintainer so this does stale out. Work is ongoing.

markmandel avatar Oct 15 '25 21:10 markmandel