kruise-game
kruise-game copied to clipboard
[proposal] Elegant update and offline of GameServers
Background
Game servers, due to their strong stateful characteristics, have a high demand for graceful shutdown operations. A game server typically needs to wait until data is fully persisted to disk and ensured to be safe before it can be thoroughly removed. Although Kubernetes natively provides the preStop hook, which allows containers to execute specific actions before they are about to shut down, there is a limitation: once the preset time limit is exceeded, the container will have to be forcibly terminated, regardless of whether the data processing is complete or not. In some cases, this approach lacks real gracefulness. We need a more flexible mechanism to ensure that game servers can exit smoothly while protecting all critical states.
OpenKruise has introduced the Lifecycle Hook feature, which provides precise control and waiting mechanisms for game servers at critical lifecycle moments. This allows servers to execute the actual deletion or update operations only after meeting specific conditions. By providing a configurable Lifecycle field, combined with the ability to customize service quality, OKG ensures that the game server's shutdown process is both graceful and reliable. With this advanced feature, maintainers can ensure that all necessary data persistence and internal state synchronization are safely and correctly completed before the server is smoothly removed or updated.
游戏服务器强状态的关键特性使它们对于优雅的下线操作有很高的需求。一个游戏服务器通常需要等待数据被完全持久化到磁盘上并确保安全后,才能进行彻底的移除。虽然Kubernetes原生提供了preStop钩子,允许容器在即将关闭前执行特定操作,但存在一个局限性:一旦超出了预设的时间限制,容器将不得不被强制终止,不管数据处理是否完成。在某些情况下,这种方法缺乏真正的优雅性。我们需要一个更灵活的机制来确保游戏服务器能够在保护了所有关键状态的前提下平滑地退出。
OpenKruise 引入了 Lifecycle Hook 功能,为游戏服务器提供了在关键生命周期节点上的精确控制和等待机制。这使得服务器能失在满足特定条件后,方才执行真正的删除或更新操作。通过提供可配置的 Lifecycle 字段,并结合自定义服务质量的能力,OKG 能够确保游戏服务器的下线过程既优雅又可靠。借助这一进阶特性,维护者可以确保所有必要的数据持久化和内部状态同步在安全无误地完成后,服务器才会被平稳地移除或更新。
Example
(This example will not be runt successfully, because lifecycle has not be exposed yet.)
apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
metadata:
name: minecraft
namespace: default
spec:
replicas: 3
lifecycle:
preDelete:
labelsHandler:
gs-sync/delete-block: "true"
gameServerTemplate:
metadata:
labels:
gs-sync/delete-block: "true"
spec:
containers:
- image: registry.cn-beijing.aliyuncs.com/chrisliu95/minecraft-demo:probe-v0
name: minecraft
volumeMounts:
- name: gsState
mountPath: /etc/gsinfo
volumes:
- name: gsinfo
downwardAPI:
items:
- path: "state"
fieldRef:
fieldPath: metadata.labels['game.kruise.io/gs-state']
serviceQualities:
- name: healthy
containerName: minecraft
permanent: false
exec:
command: ["bash", "./probe.sh"]
serviceQualityAction:
- state: true
result: done
labels:
gs-sync/delete-block: "false"
- state: true
result: WaitToBeDeleted
opsState: WaitToBeDeleted
- state: false
opsState: None
The corresponding script is as follows. The script performs the following actions:
- Acquires the current state of gs from /etc/gsinfo/state and determines whether it is "PreDelete"
- If it is PreDelete, it indicates that the current gs should be in the offline phase. It checks whether the data flushing has been completed (in this example, the presence of a file indicates data flushing completion)
- If the data flushing is not completed, it executes the data flushing action (in this example, it creates a file)
- If the data flushing is completed, it outputs "done" and exits with 1.
- If it is not PreDelete, it indicates that the gs has not entered the offline stage. It uses the number of people in the game server to determine whether it should now go offline.
- If the number of people on the game server equals 0, it outputs "WaitToBeDeleted" and exits with 1.
- If the number of people on the game server is not 0, it exits with 0.
- If it is PreDelete, it indicates that the current gs should be in the offline phase. It checks whether the data flushing has been completed (in this example, the presence of a file indicates data flushing completion)
对应的脚本如下。该脚本做了以下动作:
- 从 /etc/gsinfo/state 中拿到当前gs的状态,并判断其是否为“PreDelete”
- 若是PreDelete,则说明当前gs应处于下线阶段。判断数据落盘是否完成(这个示例中通过判断文件以文件存在表示数据落盘完成)
- 若数据落盘未完成,则执行落盘动作(这个示例是创建一个文件)
- 若数据落盘完成,则输出“done”,并以1退出。
- 若不是PreDelete,则说明该gs没有未进入下线阶段。以游戏服人数判断当前是否应该下线。
- 若游戏服人数等于0,则输出“WaitToBeDeleted”,以1退出。
- 若游戏服人数不为0,则以0退出。
- 若是PreDelete,则说明当前gs应处于下线阶段。判断数据落盘是否完成(这个示例中通过判断文件以文件存在表示数据落盘完成)
#!/bin/bash
file_path="/etc/gsinfo/state"
data_flushed_file="/etc/gsinfo/data_flushed"
if [[ ! -f "$file_path" ]]; then
exit 0
fi
state_content=$(cat "$file_path")
if [[ "$state_content" == "PreDelete" ]]; then
if [[ -f "$data_flushed_file" ]]; then
echo "done"
exit 1
else
touch "$data_flushed_file"
echo "WaitToBeDeleted"
exit 1
fi
else
people_count_file="/etc/gsinfo/people_count"
people_count=$(cat "$people_count_file")
if [[ "$people_count" -eq 0 ]]; then
echo "WaitToBeDeleted"
exit 1
else
exit 0
fi
fi
The process of elegant delete as follow:
- The game server is running normally, and the number of players is not 0.
- When the number of players drops to 0, set the opsState to WaitToBeDeleted using custom service quality settings.
- Through the automatic scaling policy, OKG deletes the GameServer with WaitToBeDeleted opsState. Since the lifecycle hook is configured and the delete-block label wil be set to true, the gs is not truly deleted but enters the PreDelete state, and the data flushing process is triggered by custom service quality.
- Once data flushing is complete, set the delete-block label to false using custom service quality to release the checkpoint.
- After the checkpoint is released, the PreDelete phase moves into the Delete phase. The gs is then truly deleted.
优雅下线的过程如下:
- 游戏服正常运行,玩家数量不为0
- 当玩家数量为0,通过自定义服务质量设置opsState为WaitToBeDeleted
- 通过自动缩容策略,OKG将该GameServer删除。由于配置了lifecycle hook,delete-block 标签为 true,gs不会真正被删除,而进入PreDelete状态,并通过自定义服务质量触发数据落盘过程。
- 当数据完成落盘,通过自定义服质量将delete-block标签设为false,卡点解除。
- 卡点解除后,PreDelete阶段将进入Delete阶段。gs真正被删除。
TODO
- Expose the Lifecycle field in GameServerSert.Spec.
- Ad "PreDelete" / "PreUpdate" runtime states for GameServer.