agones icon indicating copy to clipboard operation
agones copied to clipboard

Player Tracking for each GameServer

Open markmandel opened this issue 5 years ago • 25 comments

Is your feature request related to a problem? Please describe. It would be very useful to track player connection and disconnection against a specific GameServer.

This would allow us to do several things:

  • Track if a GameServer is full or not (we should also track GameServer capacity as well)
  • Be able to trace and debug a player lifecycle from matchmaker to gameserver connection (we should track an id or token on connection)
  • Autoscale Fleets based on how full gameservers are (useful for persistent worlds)

Describe the solution you'd like (Full design TBD)

  • Need a way to specify the capacity of a GameServer, at creation time, but also editable from the SDK at runtime
  • SDK methods to track a player connection, and player disconnection with an id/player token
  • GameServer CRD Events for player connections and disconnections
  • GameServer CRD Status values that track player counts and capacity. Maybe should also have a list of current connected players?
  • May want to add some labels to allow for searching for full/empty/partially full game servers through the k8s api

Describe alternatives you've considered Have a separate system for player tracking - but based on feedback, this is a feature that almost all users should find useful, so it feels like part of Agones, and also allows some useful functionality down the road -- such as autoscaling by player count, or automating backfill operations by searching for non-full game servers.

Additional context I do have concerns that we are adding more API QPS with this, so we should track performance with these changes.

markmandel avatar Sep 04 '19 17:09 markmandel

I also do have a concern for etcd, it could potentially be a lot depending on the design.

A Kubernetes event each time a player join ?

cyriltovena avatar Sep 15 '19 12:09 cyriltovena

Yeah agreed - was chatting with some people the other day, talking through this,

The basic idea we ended up thinking, which I think would work better long term (there are some implementation design decisions to work out too) are:

  • Only keep the player count and capacity on the GameServer CRD status value.
  • The SDK could be smart, and only sync the player count once every second (or configurable interval?), to avoid overloading the system when huge numbers come in. Backfill based on player counts will have race conditions anyway, so games will need to deal with this regardless.
  • The an event might just be the update to the player count, not each player, on the same interval as above.

To track players - have a CRD configurable webhook that has player data sent to it on connection / disconnection. Something like the gameserver name, connect/disconnect and the token that is passed through. Then we can build a separate system(s) for storing and tracking connections, and/or respond to events as needed as well.

In theory the sidecar would need to send the webhook request (I think),, but that's doable (and an implementation detail).

This requires a full write up, but I think it sounds like a reasonable approach at first pass. WDYT?

markmandel avatar Sep 16 '19 15:09 markmandel

Design ideas

The following is implementation design for both the configuration, status and game server SDK.

Feedback is much appreciated.

Configuration & Status

apiVersion: "agones.dev/v1"
kind: GameServer
metadata:
  generateName: "simple-udp-"
spec:
  # new configuration
  alpha:
    players:
      initialCapacity: 10 # sets the initial player capacity. Defaults to 0.
      webhook: # http(s) webhook to send player dis/connection events
        service:
          name: player-tracking-service
          namespace: default
          path: player
  ports:
  - name: default
    portPolicy: Dynamic
    containerPort: 7654
  template:
    spec:
      containers:
      - name: simple-udp
        image: gcr.io/agones-images/udp-server:0.17
status:
  alpha:
    # tracks players
    players:
      count: 6 # current number of players. Only updated one per second, so eventually consistent.
      capacity: 10 # current capacity. Set by "initialCapacity" and can be changed by the SDK
  • spec.alpha.players.initialCapacity - the initial player capacity as listed in the status value status.players.capacity
  • spec.alpha.players.webhook - a webhook configuration that is called whenever a player connects or disconnects. This sort of data seems ill-suited to CRDs, so Agones never stores these values, it’s the responsibility of another system to track this information as required.
    The webhook POSTs a JSON payload with the following details:
    • GameServer name
    • PlayerID (provided through the SDK)
    • Event: “connect” or “disconnect”
  • status.alpha.players.count - The current number of players in the system. This information will be eventually consistent as it is only updated once per second to avoid overloading the system. Since player count race conditions are inevitable, there should always be extra checking of player counts on player connection to a game server.
  • status.alpha.players.capacity - The current player capacity of this gameserver. Initially set by the initialCapacity value, but also able to be changed at runtime by the Game Server SDK.

SDK Functionality

SDK.Alpha.PlayerConnect(playerID)

This increases the SDK’s stored player count by one, and passes the playerID to the system such that it can fire the webhook with a connection event to any configured receivers.

status.alpha.players.count is then set to update to the current player count a second from now, unless there is already an update pending.

Will throw an error if the player count is equal to or greater than the capacity.

SDK.Alpha.PlayerDisconnect(playerID) : bool

Decreases the SDK’s stored player count by one, and passes the playerID to the system such that it can fire the webhook with a disconnection event to any configured receivers.

status.alpha.players.count is then set to update to the current player count a second from now, unless there is already an update pending.

Does nothing, and returns false if the playerID was not previously added. Returns true otherwise.

SDK.Alpha.SetPlayerCapacity(int capacity)

Update the status.alpha.capacity value with a new capacity.

SDK.Alpha.GetPlayerCapacity() : int

Retrieves the current capacity. This is always accurate, even if the value hasn’t been updated to the GameServer status yet.

SDK.Alpha.GetPlayerCount() : int

Retrieves the current player count. This is always accurate, even if the value hasn’t been updated to the GameServer status yet.

markmandel avatar Jan 07 '20 04:01 markmandel

I have some questions regarding PlayerConnect:

SDK.Alpha.PlayerConnect(playerID) This increases the SDK’s stored player count by one, and passes the playerID to the system such that it can fire the webhook with a connection event to any configured receivers.

Should we wait for webhook response that this player is able to connect? Who would be judging the player if he is able or not able to connect to the session?

Would playerID be stored on GS SDK side or separately?

Then we can build a separate system(s) for storing and tracking connections, and/or respond to events as needed as well.

aLekSer avatar Jan 22 '20 14:01 aLekSer

Hi Mark -

Could this metric be utilized for autoscaling? I'm wondering for the relay server use case, in which multiple relay server processes would run inside a single GameServer container and we need a test for "fullness" for autoscaling up and down.

--Rob

On Wed, Jan 22, 2020 at 9:48 AM Alexander Apalikov [email protected] wrote:

I have some questions regarding PlayerConnect:

SDK.Alpha.PlayerConnect(playerID) This increases the SDK’s stored player count by one, and passes the playerID to the system such that it can fire the webhook with a connection event to any configured receivers.

Should we wait for webhook response that this player is able to connect? Who would be judging the player if he is able or not able to connect to the session?

Would playerID be stored on GS SDK side or separately?

Then we can build a separate system(s) for storing and tracking connections, and/or respond to events as needed as well.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/googleforgames/agones/issues/1033?email_source=notifications&email_token=AAQ3CQNVZWLHOLJX6CIFJC3Q7BMCFA5CNFSM4ITUPHOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJTZ5HI#issuecomment-577216157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3CQJ6KFG5VBITVL5CFWLQ7BMCFANCNFSM4ITUPHOA .

--

  • • *Robert Martin

  • • *Chief Architect *• *Google Cloud for Games • [email protected]

• m: 470-449-2926

cloudrobx avatar Jan 22 '20 15:01 cloudrobx

@aLekSer good questions!

Should we wait for webhook response that this player is able to connect?

Personally, I don't think so. This could potentially make things very slow in the game server binary. I feel like this should be an async operation that happens behind the scenes.

Who would be judging the player if he is able or not able to connect to the session?

This is up to the game server binary to make this decision. So it's game / transport layer specific.

Would playerID be stored on GS SDK side or separately?

In this design, we don't store the playerID at all (except transiently in memory as it gets passed to the system that sends the webhook request). playerID management is up to the game author to track and manage. This is the main impetus for the webhook, to make this easier.

While it could be useful to store playerIDs in a CRD, it seems like a performance concern for etcd. Maybe something to explore down the line if we find we have bandwidth in etcd performance.

@cloudrobx

Could this metric be utilized for autoscaling? I'm wondering for the relay server use case, in which multiple relay server processes would run inside a single GameServer container and we need a test for "fullness" for autoscaling up and down.

Like this? #1034 :smile: would that work for what you were thinking?

markmandel avatar Jan 22 '20 17:01 markmandel

Yeah something like that looks pretty good. For placement purposes is there an easy way to pull the most (or least) full game server by player count, other than iterating through all the game servers in the cluster?

On Wed, Jan 22, 2020 at 12:37 PM Mark Mandel [email protected] wrote:

@aLekSer https://github.com/aLekSer good questions!

Should we wait for webhook response that this player is able to connect?

Personally, I don't think so. This could potentially make things very slow in the game server binary. I feel like this should be an async operation that happens behind the scenes.

Who would be judging the player if he is able or not able to connect to the session?

This is up to the game server binary to make this decision. So it's game / transport layer specific.

Would playerID be stored on GS SDK side or separately?

In this design, we don't store the playerID at all (except transiently in memory as it gets passed to the system that sends the webhook request). playerID management is up to the game author to track and manage. This is the main impetus for the webhook, to make this easier.

While it could be useful to store playerIDs in a CRD, it seems like a performance concern for etcd. Maybe something to explore down the line if we find we have bandwidth in etcd performance.

@cloudrobx https://github.com/cloudrobx

Could this metric be utilized for autoscaling? I'm wondering for the relay server use case, in which multiple relay server processes would run inside a single GameServer container and we need a test for "fullness" for autoscaling up and down.

Like this? #1034 https://github.com/googleforgames/agones/issues/1034 😄 would that work for what you were thinking?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleforgames/agones/issues/1033?email_source=notifications&email_token=AAQ3CQJF3ULGCG7PHJILVV3Q7B735A5CNFSM4ITUPHOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJUOOIY#issuecomment-577300259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3CQIRGINELI5QQFPG77DQ7B735ANCNFSM4ITUPHOA .

--

  • • *Robert Martin

  • • *Chief Architect

  • • *Google Cloud for Games

[email protected]

• m: 470-449-2926

cloudrobx avatar Jan 24 '20 00:01 cloudrobx

Yeah something like that looks pretty good. For placement purposes is there an easy way to pull the most (or least) full game server by player count, other than iterating through all the game servers in the cluster?

See: #1239 for current initial thoughts.

markmandel avatar Jan 24 '20 00:01 markmandel

For relays / high density would probably also want the ability to find the most empty or most full server as well. Since you're essentially adding another layer of bin-packing.

On Thu, Jan 23, 2020 at 7:59 PM Mark Mandel [email protected] wrote:

Yeah something like that looks pretty good. For placement purposes is there an easy way to pull the most (or least) full game server by player count, other than iterating through all the game servers in the cluster?

See: #1239 https://github.com/googleforgames/agones/issues/1239 for current initial thoughts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleforgames/agones/issues/1033?email_source=notifications&email_token=AAQ3CQPRB6P7554DM3VDJ7DQ7I4ONA5CNFSM4ITUPHOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJZMURI#issuecomment-577948229, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3CQOJ6WZGSWJX5AQY3NTQ7I4ONANCNFSM4ITUPHOA .

--

  • • *Robert Martin

  • • *Chief Architect

  • • *Google Cloud for Games

[email protected]

• m: 470-449-2926

cloudrobx avatar Jan 24 '20 01:01 cloudrobx

For relays / high density would probably also want the ability to find the most empty or most full server as well. Since you're essentially adding another layer of bin-packing.

This sounds like the current "Distributed" vs "Packed" strategies we already have in place. Also probably a topic to discuss more on #1239 rather than here, since this ticket isn't covering allocation - just player count tracking.

markmandel avatar Jan 24 '20 01:01 markmandel

Made a small tweak to SDK. Renamed GetCapacity and SetCapacity to GetPlayerCapacity and SetPlayerCapacity - just to be clearer.

markmandel avatar Feb 01 '20 19:02 markmandel

Some design updates that came out of the discussion on https://github.com/googleforgames/agones/pull/1447#discussion_r406637680.

And we came back around to storing the list of player id tokens on the CRD, to avoid any issues with storing player Ids in memory, and things that could go wrong if the Agones sidecar crashed.

Also the extra data stored wouldn't be that large, and we wouldn't take any extra API hits, since we would only update the CRD at the same time we update counts (which are batched).

So the following design changes will occur (I'll update the above design section shortly).

  1. Create a section on the CRD to stored player ids
  2. Drop the webhook CRD section, as the a Kubernetes GameServer update would have all the information about a player connecting/disconnecting, since it has both the before and after states (and would be a nice batch operation).
  3. Implement a SDK.IsPlayerConnected(token): bool method

We also discussing if SDK.PlayerConnect() and/or SDK.PlayerDisconnect() should return a bool/exception if the player id didn't / didn't exist when expected. I just noticed that we had designed PlayerDisconnect() to return a bool with a success/failure, so unless anyone has objections, I'll update the design on SDK.PlayerConnect() to do the same thing. But if you think it should return an exception, please definitely say so.

markmandel avatar Apr 14 '20 01:04 markmandel

v2 Design

The following is implementation design for both the configuration, status and game server SDK.

This includes feedback and decisions from several PR's and commentary.

Feedback is much appreciated.

Configuration & Status

apiVersion: "agones.dev/v1"
kind: GameServer
metadata:
  generateName: "simple-udp-"
spec:
  # new configuration
  players:
    initialCapacity: 10 # sets the initial player capacity. Defaults to 0.
  ports:
  - name: default
    portPolicy: Dynamic
    containerPort: 7654
  template:
    spec:
      containers:
      - name: simple-udp
        image: gcr.io/agones-images/udp-server:0.17
status:
    # tracks players
    players:
      count: 6 # current number of players. Only updated one per second, so eventually consistent.
      capacity: 10 # current capacity. Set by "initialCapacity" and can be changed by the SDK
      ids: ["b7xy0", "mn87i", "p9un1"] # list of player ids currently connected
  • spec.players.initialCapacity - the initial player capacity as listed in the status value status.players.capacity
  • status.players.count - The current number of players in the system. This information will be eventually consistent as it is only updated once per second to avoid overloading the system. Since player count race conditions are inevitable, there should always be extra checking of player counts on player connection to a game server.
  • status.players.capacity - The current player capacity of this gameserver. Initially set by the initialCapacity value, but also able to be changed at runtime by the Game Server SDK.
  • status.players.ids - The set of connected player ids at this given moment.

SDK Functionality

All SDK functions that change the GameServer state will be asynchronous, inline with all other state changing SDK commands. To copy paste from the docs

Calling any of state changing functions mentioned below does not guarantee that GameServer Custom Resource object would actually change its state right after the call ... You can verify the result of this call by waiting for the desired state in a callback to WatchGameServer() function.

SDK.Alpha.PlayerConnect(playerID) : bool

This increases the SDK’s stored player count by one, and appends this playerID to status.players.id.

status.players.count and status.players.ids is then set to update to the player count and id list a second from now, unless there is already an update pending, in which case the update joins that batch operation.

PlayerConnect returns true and adds the playerID to the list of playerIDs if the playerIDs was not already in the list of connected playerIDs.

If the player exists within the list of connected playerIDs, PlayerConnect will return false, and the list of connected playerIDs will be left unchanged.

An error will be returned if the playerID was not already in the list of connected playerIDs but the player capacity for the server has been reached. The playerID will not be added to the list of playerIDs.

SDK.Alpha.PlayerDisconnect(playerID) : bool

Decreases the SDK’s stored player count by one, and removes the playerID from status.players.id

status.players.count and status.players.ids is then set to update to the player count and id list a second from now, unless there is already an update pending, in which case the update joins that batch operation.

PlayerDisconnect will return true and remove the supplied playerID from the list of connected playerIDs if the playerID value exists within the list.

If the playerID was not in the list of connected playerIDs, the call will return false, and the connected playerID list will be left unchanged.

SDK.Alpha.SetPlayerCapacity(int capacity)

Update the status.capacity value with a new capacity.

SDK.Alpha.GetPlayerCapacity() : int

Retrieves the current capacity. This is always accurate, even if the value hasn’t been updated to the GameServer status yet.

SDK.Alpha.GetPlayerCount() : int

Retrieves the current player count. This is always accurate, even if the value hasn’t been updated to the GameServer status yet.

SDK.Alpha.IsPlayerConnected(playerID) : bool

Returns if the playerID is currently connected to the GameServer. This is always accurate, even if the value hasn’t been updated to the GameServer status yet.

SDK.Alpha.GetConnectedPlayers() : []string

Returns the list of the currently connected player ids. This is always accurate, even if the value hasn’t been updated to the GameServer status yet.

markmandel avatar Apr 14 '20 02:04 markmandel

I like this ... ony thing that I do find strange is that the SDK.Alpha.PlayerConnect & SDK.Alpha.PlayerDisconnect are not idempotent.

Personally for SDK.Alpha.PlayerConnect would feel nicer to return true if you are connected already or connecting for the first time and false if you are unable to connect.

Similarly with SDK.Alpha.PlayerDisconnect I feel that it would be nicer to return true if the player was connected and you disconnect them and false otherwise.

It may also be worth having a GetConnectedPlayers API call as we expose the data in the CRD:

      ids: ["b7xy0", "mn87i", "p9un1"] # list of player ids currently connected

domgreen avatar Apr 14 '20 16:04 domgreen

I think we might be saying the same thing, although maybe my words are confusing:

I like this ... ony thing that I do find strange is that the SDK.Alpha.PlayerConnect & SDK.Alpha.PlayerDisconnect are not idempotent.

Both methods should be idempotent - I think we're on the same page there.

Personally for SDK.Alpha.PlayerConnect would feel nicer to return true if you are connected already or connecting for the first time and false if you are unable to connect.

My design would return true on the first call of this method with a unique playerId. I think your wording might be better. It would return false only if the player is already marked as connected (hence the idempotency). The bool value is a indicator of a successful operation.

Similarly with SDK.Alpha.PlayerDisconnect I feel that it would be nicer to return true if the player was connected and you disconnect them and false otherwise.

I think we're saying the same things here too. Is there a better way it could be worded?

It may also be worth having a GetConnectedPlayers API call as we expose the data in the CRD:

I couldn't decide if we should have that, so figured I'd wait and see if someone requested it. Sounds like a good idea, I'll add it to the design now! :+1:

markmandel avatar Apr 14 '20 17:04 markmandel

Ah, I think I was using idepotent wrong in thinking we should have the same return value each call but this is in-fact not required. So yes I think we are both on the same page and just a slight update on wording. 👍

Have had an attempt at rewording (mainly for my own understanding) feel free to take bits or ignore. 😀

PlayerConnect From:

Returns false if the playerID is already marked as connected, true if not.

To:

PlayerConnect returns true and add the playerid to the list of playerids if the playerid was not already marked as connected. If the player was already marked as connected false will be returned, and the list of connected playerids will be left unchanged.

An error will be returned if the playerid was not already in the list but the player capacity for the server has been reached. The playerid will not be added to the list of playerids. 

PlayerDisconnect

From:

Returns false if the playerID was not previously marked as connected. Returns true otherwise.

To:

PlayerDisconnect will return true and remove the supplied playerid from the list of playerids. If the playerid was not in the list of playerids the call will return false and the list will be left unchanged.

domgreen avatar Apr 14 '20 18:04 domgreen

@domgreen nice! Much more explicit. I tweaked it a little in some places, but I think it reads much better (and is going to end up being our docs anyway).

Please take a look - I think this is much better :+1:

markmandel avatar Apr 14 '20 20:04 markmandel

An error will be returned if the playerid was not already in the list but the player capacity for the server has been reached. The playerid will not be added to the list of playerids.

In both connect and disconnect, will we return an error if the sidecar is unable to update the CRD? Or since the CRD updates are batched and async, is that not possible?

If the sidecar can't update the CRD, is there a way to inform the game that there is an issue? Or will it just be silently swallowed by the sidecar and only visible in the sidecar logs?

roberthbailey avatar Apr 15 '20 05:04 roberthbailey

@markmandel much better :) just typo playerDIs in Connect.

@roberthbailey why might we not be able to update the CRD?

I think as this would all be async and the source of truth would be in memory (actually probably in the game I think). If for whatever reason an error would occur the gameplay code would need to handling it. The game itself doesnt really want an error if CRD isnt updated (my 2c). I would wait till next update time and try to set it again to the latest value.

domgreen avatar Apr 15 '20 07:04 domgreen

In both connect and disconnect, will we return an error if the sidecar is unable to update the CRD? Or since the CRD updates are batched and async, is that not possible?

This is a good point. I will make a note on the design doc -- but this SDK implementation would match the other SDK methods. The functions are async in that they go into our standard workerqueue, and will retry until completed. The only differenec here being that they are enqueue'd one second in the future. So the execution may intermittently fail (say a master goes down for a short period), but will eventually pass on retry (unless something catastrophic occurs).

I think as this would all be async and the source of truth would be in memory.

Exactly. Which is how we handle all GameServer changes from the SDK, and also why we have SDK methods that say that "this will always be accurate, even if the details haven't been stored on the CRD". So SDK.GetGameServer() will watch what is stored in the CRD, whereas SDK.Alpha.GetPlayerCount() (and related) will look at the local, in memory version, which is updated to the CRD on change.

markmandel avatar Apr 15 '20 17:04 markmandel

This has been released in Alpha.

A heads up to @devjgm, @Reousa, @steven-supersolid, @mollstam, @dotcom , @drichardson who have been doing a lot of SDK work to include this. If you have cycles to work on adding these to your SDK of expertise, please let us know here.

markmandel avatar May 28 '20 16:05 markmandel

Hey Mark! Thanks for the heads up, would be great to be able to contribute again! :)

Gave the issue a read, I see these have already been updated to the relevant proto files, so all that's needed is the relevant language implementation, yeah?

If my understanding is correct, consider it done (hopefully in a timely manner)! 👌

Reousa avatar May 29 '20 20:05 Reousa

That's a good point - you will need to edit the gen.sh scripts such that they generate the code for the alpha.proto, and then yes -- implement the SDK over the top :+1:

Thanks!

markmandel avatar May 29 '20 21:05 markmandel

Working on Node.js now

steven-supersolid avatar Jun 28 '20 14:06 steven-supersolid

I'm going to suggest we close this ticket, since this has been released to alpha, and there is (mostly?) supported in the SDKs - and I think we can then track each of those as seperate tickets as needed.

Also #1677 is going to break all the things (will likely aim to tackle in the next release), so we can track the ongoing work there as well.

Sound good?

markmandel avatar Aug 24 '21 18:08 markmandel

Closing as per the last comment (1.5 years ago).

roberthbailey avatar Feb 08 '23 08:02 roberthbailey