Is your feature request related to a problem? Please describe.

Allow the fleet autoscaler to have a buffersize of 0. During development, it is a cost savings to not be running idle game servers and the startup latency of spinning up a new one is not an issue.

Describe the solution you'd like

Change the validation here to allow the buffer size to be 0.

Describe alternatives you've considered Leave it the way it is now.

Additional context n/a

Sep 02 '20 16:09 roberthbailey

Question on this:

Do you want a buffersize of 0? Or do you want a min replicas of 0, with a buffersize of 0?

Which probably are different problems :smile:

Basically, are we talking about scaling to zero at this point?

The TL;DR either way is - I don't think it's worth the extra complexity to autoscale to 0, since there are many, many complex edge cases, and you are going to have at least 1 node running in your cluster regardless - so having just 1 game server in there is not going to cost you anything extra infrastructure wise.

But would love to hear more details!

Sep 02 '20 17:09 markmandel

I have a use case around this :smile:

TL;DR; want a fleet per engineer that regularly test their branch against a server build.

I would like to be able to spin up an arbitary number of fleets for internal testing purposes, game devs want to be able to spin up the servers for their own branch and not effect other people.

So this would be the latter I want replicas to be able to be 0 and therefore probably only want to start buffering when at least one instance is up.

For us this currently means that I am running a fleet per engineer who regularly tests their branch so we do not need to buy or setup hardware. Currently means that there is a fair few extra nodes running on the machines that are just sat in READY whislt this isnt too much of an issue it often means spilling over onto another box. Which means we pay extra (startup life is tough).

Nov 06 '20 18:11 domgreen

The original use case I heard was also for development. I'm wondering if there is a different way to tackle this other than changing the fleet autoscaler....

You can currently create an arbitrary number of fleets set to 0 replicas without an autoscaler. So the question becomes what changes the size from 0 -> 1 and back to 0. The first answer seems to be to try and make the fleetautoscaler do it, but it's also something you could drive from your CI system as well.

Would something like this work:

Trigger build, create image, push to registry
Update fleet spec with new build, set replicas to 1
Create cron job (in CI or in k8s) to set replicas back to 0 after N hours (would probably want to check a hash or create time here)

Depending on what the devs are doing (maybe they need a variable number of game servers) step 2 could be to insert a fleet autoscaler and step 3 could be to remove it and set the replicas back to zero.

This would mean that as long as the developer is actively pushing changes they would have a game server to test. And if they aren't then it gets reaped automatically.

Nov 06 '20 22:11 roberthbailey

:point_up: I like this idea.

This goes back to my original question:

Do you want a buffersize of 0? Or do you want a min replicas of 0, with a buffersize of 0?

And if you want min replicas of 0 -- what tells the system, "Hey, I'd like a Ready GameServer now, so I can do an allocation shortly" ?

I think @roberthbailey 's strategy above is a good one. Maybe even tie it into your dev matchmaker somehow?

Nov 06 '20 23:11 markmandel

Given that this has been stagnant for a long time, I'm going to close it as "won't implement" (at least for now). We can always re-open to continue the discussion if there is anything more to add later.

Jun 23 '21 20:06 roberthbailey

Hi!

Could we consider re-opening this discussion?

Allow the fleet autoscaler to have a buffersize of 0. During development, it is a cost savings to not be running idle game servers and the startup latency of spinning up a new one is not an issue.

This is exactly my use case. I'm setting up Agones for my open-source pet project, and I would definitely like to cut infrastructure costs while the game is in active development, and a player may appear maybe once a month.

Thank you.

Nov 10 '21 21:11 vladbat00

I'm happy to re-open, but I don't know if this is something we will be able to prioritize soon.

Nov 10 '21 22:11 roberthbailey

I'll repeat my original question:

Do you want a buffersize of 0? Or do you want a min replicas of 0, with a buffersize of 0?

Which then leads into the following questions that came after that. Without answers to those questions, I'm not sure what more we can do here to automate this.

One solution several people have done is use the webhook autoscaler and coordinate that with your dev matchmaker to size up Fleets as needed based on the needs of the development system, since your system actually knows if you need new GameServers to scale up from 0 and Agones has no idea.

Nov 10 '21 22:11 markmandel

I believe I want both min replicas and buffersize set to 0.

One solution several people have done is use the webhook autoscaler and coordinate that with your dev matchmaker to size up Fleets as needed based on the needs of the development system, since your system actually knows if you need new GameServers to scale up from 0.

That's an interesting idea. I'll dive into its documentation deeper, maybe it's indeed something I could use as a solution.

and Agones has no idea

Won't creating new game server allocations give Agones the idea to scale up the fleet? I was thinking about coding my matchmaker service the way so that it would ask Kubernetes API to create new allocations.

Nov 10 '21 22:11 vladbat00

Won't creating new game server allocations give Agones the idea to scale up the fleet?

But we can't guarantee that a game server will spin up before the allocation request timed out (I've lost track if it's 30s or a minute) - which is locked from the K8s API.

Nov 10 '21 22:11 markmandel

Would be nice to have this:

apiVersion: "autoscaling.agones.dev/v1"
kind: FleetAutoscaler
metadata:
  name: fleet-autoscaler
spec:
  policy:
    buffer:
      bufferSize: 0
      minReplicas: 0
      maxReplicas: 2

I have any fleets, most of these sit still. Only tested from to time.

Nov 15 '21 15:11 dzmitry-lahoda

@dzmitry-lahoda what would that do exactly? leave the Fleet at 0? And then what happens on allocation? I'm assuming nothing.

At which point, I'm wondering what is the point of the autoscaler at all? 🤔

Nov 15 '21 20:11 markmandel

i see that fleet needs to run at least 1 hot server. our main gs do run at least one. but testing and debug gs are launched from time to time, not often, so not sure if these need 1 hot. why to use fleet? to reuse same mm and devops flows for these gs. if allocation is requested and not gs ready, and not limit reached, may launch one gs. fine for first allocation request to timeout. system will ask once more in loop. if gs did not became allocated, but stuck i ready, shutdown it after timeout. assuming 2x of allocation timout time. i would not like to operate fleet via api. developing dev ops flow which depending on some activity scales to one and back seems complicated.

Nov 15 '21 20:11 dzmitry-lahoda

So ultimately, you are asking for scale to zero with Fleet auto scaling, which I'm not against, but is a fair bit of a nightmare to handle all the edge cases. I also don't think we can do a "scale to zero, but only for development". As soon as it exists, it needs to work for production and development at all times.

What (I think I understood) from the above, probably works for you and your game, but may not work for everyone, so it requires a pretty thorough design, with consideration for all the race conditions that can occur. This is also noting that this system is not (mostly) imperative. It's a declarative, self-healing system with a set of decoupled systems working in concert - so it's a little bit trickier than saying "on allocation, just spin up a game server". Who has that responsibility? Is the allocator service now changing replicas in the Fleet? (which is otherwise the autoscaler's responsibility). What happens when the autoscaler collides with the allocation creating a new GameServer and removes it? Maybe we should change the min buffer on the autoscaler? But then, what tells it to scale back down? Uuurg. it gets very messy very quickly.

developing dev ops flow which depending on some activity scales to one and back seems complicated.

I am amused by this 😁 yes, this is complicated, that's why we've never really done it.

But I'd love hear if people have detailed designs in mind that covers both scale up and scale down that cover all the integrated components 👍🏻

Nov 15 '21 22:11 markmandel

my naive attempt

code design

spec state of fleet becames sum type.

enum FleetSize { 
    YamlSpec(buffer, replicas),
    LiftedSpec(YamlSpec, least_timeout)
}

fleet is 1

GSA request coming
all works as before

fleet is 0

GSA request is coming
Fleet spec swapped with into LiftedSpec
Agones allocator allocates as if spec is lifted value of buffer is 1

Ready

if GS did not change state, A(agones allocator) swap specs back to basic YamlSpec.
Deallocates

Allocated

A behaves as if YamlSpec was changed for N to N - 1, deallocates by rules of that
So it does not deallocates until it in state Allocated
If shutdown happened and but lease timeout not passed, allocated again

fleet is 1, but yaml speced to 0

follow rules of N to N - 1

what is spec was changed to 0, and allocation came along to change to LiftedSpec to 1

it is fine to deallocated and allocated new, not sure how it differs from N to N -1 and almost immediate N - 1 to N?

fleet is 1+ always

algorithm is disabled, so if bugs will exists, they will be only isolated to near zero fleets which are already debug fleets

concerns

implementing externally via devops is more brittle and complex then from within
probably simple sum type with lease may really help to handle

What happens when the autoscaler collides with the allocation creating a new GameServer and removes it? Maybe we should change the min buffer on the autoscaler? .

would collision be the same as if i change yaml manually to +1 and then -1 and than + 1?

But then, what tells it to scale back down?

LiftedSpec lease timeout. LeastTimeout should be at least 2x of allocation timeout. I would prefer it can be dynamically set via k8s API of A.

production

enum FleetSize { 
    YamlSpec(buffer, replicas),
    LiftedZeroSpec(YamlSpec, least_timeout),
    LiftedProductionSpec(YamlSpec, LeasedAllocatorFunction)
}

So in production I can than provide least to grow buffer from 13 to 42 if LeasedAllocatorFunction tells to do so. Default LeasedAllocatorFunction would be ingnorant about allocation requests. With option to swap with allocator fuction name, like can be some simple linear interpolation from last 3 minutes of allocation (time needed to swawn new VM and docker).

Not sure what is hook in Agones to do buffer customization except chaging spec via YAML, but again - it would be nice some algorithm build in - it may be similar tech as used for near zero.

Nov 16 '21 09:11 dzmitry-lahoda

For the use case of development, we thought about doing something like this but we actually settled on an internal command (in our case its in game, but you could do a slack command/something else all the same) that creates a one off GameServer instance that is programmatically applied to Kubernetes. This enabled us to run the full atones workflow without having to create and manage a fleet and fleetautoscaler for those special instances.

Nov 19 '21 10:11 theminecoder

yeah, can create if in code like, if fleet exists, create from fleet, if not exists, create some named GS. but who controls how many of such GS are around? so will build some limit on such servers. also we have clean up per fleet (just kill long running), another logic. so kind of imitating fleet with random docker images. but somehow need to define docker version, we have gitops for fleets, so will need some gitops for fleets or extend gs client to do so(pass specific docker image). it is possible. but raises different set of question around feature fleet which can be 0 instances and run any docker adhoc and has some kind of limit somehow

Nov 19 '21 14:11 dzmitry-lahoda

i would either have users who are responsible (have access to cluster and can allocated as please, and then clean) and others, who goes only via formalism of fleet setup.

Nov 19 '21 14:11 dzmitry-lahoda

But we can't guarantee that a game server will spin up before the allocation request timed out (I've lost track if it's 30s or a minute) - which is locked from the K8s API.

@markmandel oh, I didn't know about that. Is there a way to track allocations? I noticed that they always spawn a gameserver in case there wasn't available one, but the response doesn't always include a server name (I believe it happens if a game server didn't exist and it's a newly created one). So I believe it makes it impossible to reliably correlate allocations to game servers and track their status in this case.

Also, what makes a game allocation time out? My current understanding is that it will time out if a game server doesn't get promoted to Allocated, and one of the cases, when that might happen, is that a node suitable for the game server's pod doesn't boot up in time. Please correct me if I'm wrong.

And the last question I have: if an allocation request times out, will a game server get deleted as well? Or will it remain in Starting state until it finds a node?

If a game server allocation always responded with a game server name and game servers outlived timed out allocations, I believe the problem you've mentioned could be mitigated by letting clients wait for game servers to finally boot up and re-sending game allocations if needed.

Nov 19 '21 18:11 vladbat00

enum FleetSize { 
    YamlSpec(buffer, replicas),
    LiftedSpec(YamlSpec, least_timeout)
}

So I'm struggling with this for a variety of reasons:

Go doesn't have sum types, so this doesn't really translate to the language the controllers are all written in.
We shouldn't be switching out a declarative spec at runtime. This seems extremely counter to how Kubernetes works.

The more I think about this too, I don't think there should be blocking operations in an allocation. We already have a limited set of retries, but allocation tends to be a hot path, and I'm not comfortable putting blockers in it's way.

To make things even MORE complicated: an Allocation is not tied to a Fleet. It has a set of preferential selectors, which can easily be cross Fleet, based on arbitrary labels, or used with singular GameServers -- so we can't even tie a Fleet spec/replicas/autoscalers to an Allocation either.

Again, I come back to webhooks. Only you know how to scale to 0 based on your matchmaking criteria (especially if you are scaling your nodes to 0, and need to account for node scale up time), will likely want to do it well before allocation happens, and you are the one that knows which Fleets to scale and when.

I don't think Agones can do that for you.

Nov 22 '21 22:11 markmandel

Go doesn't have sum types, so this doesn't really translate to the language the controllers are all written in. Sum can be concept, encoded as

struct StateUnion {
    is_state_a : bool
    is_state_b : bool
    state_a : *SomeStuff
   state_b : *OtherStuff
}

unsafe, but can coded as pattern. typical sum type from go is return of error and result value, usually when error than there is no result.

We shouldn't be switching out a declarative spec at runtime. This seems extremely counter to how Kubernetes works. that can be internal detail, each time api is requested, it responds with original spec. i am more on internal state and safely handle it everywere.

The more I think about this too, I don't think there should be blocking operations in an allocation. We already have a limited set of retries, but allocation tends to be a hot path, and I'm not comfortable putting blockers in it's way

sure, there should not be. but there could be extra speculative heuristic to overallocate?

To make things even MORE complicated: an Allocation is not tied to a Fleet. It has a set of preferential selectors, which can easily be cross Fleet, based on arbitrary labels, or used with singular GameServers -- so we can't even tie a Fleet spec/replicas/autoscalers to an Allocation either.

so allocation and warm up from zero several fleets? may fine if that was requested which covers several fleets by selectors?

Only you know how to scale to 0 based on your matchmaking criteria (especially if you are scaling your nodes to 0, and need to account for node scale up time), will likely want to do it well before allocation happens, and you are the one that knows which Fleets to scale and when

I can do that, but not sure what MM should know. Example,

I have VM with 4 GS Allocated.There is room for 5, which is Ready, with buffer of one. So right after that 5th allocated, new VM should warm up. Imagine that buffer of VMs is zero. So allocation request will fail in loop(rate limited client requests, with open match amid user and allocator), until VM will boot up. By induction this may happen with any buffer size.

No blocking allocation with non infinite buffer of everything do fail and timeout regardless of fleet size. Need to account for node scale up time.

But yeah, it is more clear on options which can be used to imitate zero size fleet.

Another option. Fleet with CPU limit of near zero. In this case fleet will never be able to run instance. When AR comes via MM (before allocated call to Agones), call K8S to increase CPU limit and after timeout reduce it back. Store timeout, time, and limits in Fleet annotations, so can always set these right. But these requires write access to K8S :( So will need to isolate that with RBAC and even namespace.

Nov 23 '21 07:11 dzmitry-lahoda

my 2p ... I think this isn't an agones and will likley add more complexity and confusion for the majority of users ... it is more of a workflow/testing issue that could be different for all studios.

Much like @theminecoder our studio has got around this issue using a workflow that will easily spin up and down fleets/game servers via CI/CD this achieves the same goal and keeps it out of the Agones codebase.

Even running UE4 game servers does not give very much additional cost or complexity, there is some logic that means (the majority of) dev servers have all maps loaded rather than requiring server travel to different containers but this is adequate for them and can easily be altered if server travel needs to be tested.

Nov 23 '21 08:11 domgreen

@domgreen so everybody solves same problem, builds work around. why would not it should make into agones?

workarounds create additional attack vector. what if i would want to allow to run test GS on live env. having some another job wich changes fleets settings or allocates wild severs does not looks as something that does not require special care

Nov 23 '21 10:11 dzmitry-lahoda

I wanted to share my results of building a webhook fleet autoscaler, as @markmandel has suggested in one of the previous posts. The implementation turned out to be fairly simple.

In my setup, I have a matchmaker service, which clients connect to via WebSocket when they want to find a server to play. In order to track servers, my matchmaker subscribes to a kubernetes namespace and watches GameServer resources. My logic of determining how many servers are needed reads as follows:

let active_players = websocket_subscribers + servers.iter().map(|s| s.player_count).sum()
let desired_replicas = active_players.unstable_div_ceil(PLAYER_CAPACITY);

I use the assumption that PLAYER_CAPACITY is static and is equal for every GameServer, but this can easily be changed if your case requires more complex logic.

So, basically, the only introduced complexity over the initially discussed variant (when matchmaker just calls Agones/Kubernetes API to create an allocation) is spinning up an HTTP listener, that would respond with the information about desired replicas count, instead of making API calls by yourself.

With having all the mentioned problems unsolved (allocation retries, blocking, etc), the webhook indeed sounds like a more reliable solution, and it allows for more flexibility.

If anyone's interested in a ready recipe for a matchmaker service written in Rust, I can share my example:

Matchmaker service (or commit locked link if I ever change the project structure)
- See the serve_webhook_service function for the webhook implementation
Terraform config for my EKS cluster

Disclaimer: my DevOps knowledge is quite limited, so read the config with caution if you decide to take inspiration from it. I can't promise the setup is effective and secure enough. :)

Another important note is that my current setup assumes only 1 replica for the matchmaker service. Fixing this limitation would require a more complex solution that would support sharing the state between replicas. Otherwise, different replicas will respond with different numbers of active WebSocket subscribers, and that can affect the desired fleet replica count.

I hope this helps.

Nov 25 '21 08:11 vladbat00

In order to track servers, my matchmaker subscribes to a kubernetes namespace and watches GameServer resources. My logic of determining how many servers are needed reads as follows:

Looking at the FleetStatus - number of GameServers, and total player count and capacity is available. Does it not come through in the JSON?

https://agones.dev/site/docs/reference/agones_crd_api_reference/#agones.dev/v1.FleetStatus

Nov 26 '21 00:11 markmandel

@markmandel I didn't check it tbh, but I don't have any reason to believe it doesn't work. My use-case requires watching game servers anyway, as I want to list their names, IP addresses and player count.

Nov 26 '21 06:11 vladbat00

Adding another small use case to this: this could be a minor convenience when manually deploying servers. E.g. you have a playtest one morning and deploy a Fleet + Fleet Autoscaler for it. You know that you'll be testing again tomorrow with the same build but you don't want to leave the servers up for a day (save money, or to block access). Instead of tearing down the fleet you just scale to 0, then scale back up the next day- you don't have to keep the Fleet/Fleet Autoscaler configuration handy for the second day, particularly useful if it's done by a different person. The answer to

what tells the system, "Hey, I'd like a Ready GameServer now, so I can do an allocation shortly" ?

in this case is a manual operator.

Dec 08 '21 07:12 pgilfillan

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

Aug 01 '23 10:08 github-actions[bot]

This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions

Sep 01 '23 02:09 github-actions[bot]

Bumping to unstale

Sep 01 '23 08:09 theminecoder

agones
agones copied to clipboard

Allow fleet autoscaler buffersize to be 0

code design

fleet is 1

fleet is 0

Ready

Allocated

fleet is 1, but yaml speced to 0

what is spec was changed to 0, and allocation came along to change to LiftedSpec to 1

fleet is 1+ always

concerns

production

agones agones copied to clipboard

Allow fleet autoscaler buffersize to be 0

code design

fleet is 1

fleet is 0

Ready

Allocated

fleet is 1, but yaml speced to 0

what is spec was changed to 0, and allocation came along to change to LiftedSpec to 1

fleet is 1+ always

concerns

production

agones
agones copied to clipboard